In this report, I seek to determine whether there is a significant difference in income between men and women, and, if so, whether the difference observed varies depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.).
To address this question, I use data collected from the National Longitudinal Survey of Youth, 1979 cohort (NLSY79). “The 1979 Cohort,” according to the project’s webpage, “is a longitudinal study that follows the lives of a sample of American youth born between 1957-64. The cohort originally included 12,686 respondents ages 14-22 when first interviewed in 1979; after two subsamples were dropped, 9,964 respondents remain in the eligible samples” (see https://www.nlsinfo.org/content/cohorts/nlsy79 for more information). To support the analysis that follows, I was provided with a base data set containing just 70 of the tens of thousands of variables included in the original data set. This base data set can be accessed in its original form on the Programming R for Analytics course website (http://www.andrew.cmu.edu/user/achoulde/94842/), along with accompanying files describing each of the variables used.
# Break down income by gender
nlsy.table.gender <- nlsy %>%
group_by(gender) %>%
summarize(mean = mean(income, na.rm = TRUE),
lower = t.test(income)$conf.int[1],
upper = t.test(income)$conf.int[2])
kable(nlsy.table.gender)| gender | mean | lower | upper |
|---|---|---|---|
| Male | 53445.91 | 51115.77 | 55776.05 |
| Female | 29538.51 | 28386.75 | 30690.27 |
## Warning: Ignoring 5662 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning: Removed 5662 rows containing non-finite values (stat_ydensity).
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 53446.
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 29539.
# Calculate wage gap (difference in income between men and women)
nlsy.table.gender[1,2] - nlsy.table.gender[2,2] # absolute## mean
## 1 23907.4
## mean
## 1 44.73
The table above shows very clearly that there is a substantial difference in the incomes between male and female respondents in the NLSY Survey 1979 Cohort. While male respondents reported an average income of $53445.91, female respondents reported an average income of only $29538.5073265. In other words, male respondents reported average earnings of $23907.4, or 44.73%, more than their female counterparts.
The distribution of incomes between male and female respondents is seen even more clearly in the boxplot above. This graph shows that men not only have higher median incomes compared to women, but also that they have greater variability in income than do women, especially in the higher income ranges (3rd and 4th quartile ranges).
The violin plot shows more clearly the varying proportion of males and females falling into each portion of the range for income. While both men and women have the highest concentrations of their respective populations in the extreme low range of the distribution, a much higher proportion of men fall into the above $30,000 range than do women, who are pretty densely packed around the $25,000 mark.
# Graph wage by race
wage.race <- nlsy %>%
group_by(race) %>%
summarize(count = n(), income = mean(income, na.rm = TRUE))
# Table
wage.race ## # A tibble: 3 x 3
## race count income
## <fct> <int> <dbl>
## 1 Other 7510 50839.
## 2 Black 3174 28325.
## 3 Hispanic 2002 36554.
# Plot
wage.race.obj <- ggplot(data = wage.race, aes(y = income, x = race, fill = race)) +
theme(legend.title = element_blank())
wage.race.obj + geom_bar(stat = "identity", fill = I("steelblue")) +
xlab("Race") +
ylab("Average income") +
ggtitle("Average income by race")# Calculate proportional representation of genders per race category
prop.race <- prop.table(table(nlsy$race, nlsy$gender), margin = 2)
prop.race.per <- round(prop.race* 100, 2)
prop.race.per##
## Male Female
## Other 59.19 59.21
## Black 25.19 24.84
## Hispanic 15.62 15.95
The graph above shows that income is strongly correlated with race. Specifically, it shows that Other (non-black, non-hispanic) respondents earned an average income of $50838.84, compared to $36554.36 for Hispanic respondents and $36554.36 for black respondents. If a disproportionate number of female respondents were also women of color, then race may be acting as a confounder in our estimates of the effect of gender on income. However, when we calculate the proportions of male and female respondents per race category, we see roughly equal proportions across all races. We can therefore rule out the possibility that race is driving the differences we observe in income between male and female respondents.
The next analysis examines whether the wage gap observed between men and women also varies by race.
# Graph wage gap by race
wage.gap.race <- nlsy %>%
group_by(race) %>%
summarize(income.gap = mean(income[gender == "Male"], na.rm = TRUE) -
mean(income[gender == "Female"], na.rm = TRUE),
upper = t.test(income ~ gender)$conf.int[1],
lower = t.test(income ~ gender)$conf.int[2],
is.significant = as.numeric(t.test(income ~ gender)$p.value < 0.05))
# Re-order the gender factor according to gap size
wage.gap.race <- mutate(wage.gap.race, race = reorder(race, income.gap))
# Plot, with error bars
ggplot(data = wage.gap.race, aes(x = race, y = income.gap, fill = is.significant)) +
geom_bar(stat = "identity") +
xlab("Race") +
ylab("Income gap ($)") +
ggtitle("Income gap between men and women, by race") +
guides(fill = FALSE) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
theme(text = element_text(size=12)) The graph above shows that the wage gap between men and women holds across all races measured. In each case, the difference is statistically significant, as indicated by the fact that the error bars do not cross the 0 line. Moreover, the wage gap is greatest among non-black, non-hispanic respondents (Other), followed by Hispanic, and finally Black.
The proportions table shows that men and women are representated equally among each of the race categories. This suggests that race probably isn’t driving the wage disparity we observe between men and women-i.e., women are not merely appearing to earn less than men because of they are more strongly represented in the disadvantaged race categories. However, the fact that the wage disparity is greater among certain categories than others indicates that a person’s race does influence the magnitude of income disparity they are likely to experience relative to their male counterparts.
The next set of tables and graphs explore whether the wage gap between men and women might be due, in part, to differences in professional qualification and/or occupational choices. First, I look at whether male and female respondents are systematically choosing to work in different industries, and whether these choices may help to explain the difference in income between them.
# Alternative hypothesis 1 tables and graphs - professional qualifications and occupational choice
# Industry tables and graphs
# Analyze mean income by industry
nlsy.indinc1 <- nlsy %>%
group_by(industry) %>%
summarize(count = n(), income = mean(income, na.rm = TRUE))## Warning: Factor `industry` contains implicit NA, consider using
## `forcats::fct_explicit_na`
| industry | count | income |
|---|---|---|
| Health Care and Social Assistance | 994 | 38733.419 |
| Agriculture, Forestry, Fishing, and Hunting | 76 | 32112.960 |
| Mining | 37 | 70624.444 |
| Utilities | 75 | 77828.750 |
| Construction | 493 | 33510.458 |
| Manufacturing | 774 | 51756.430 |
| Wholesale Trade | 178 | 58846.483 |
| Retail Trade | 566 | 31745.682 |
| Transportation and Warehousing | 392 | 49442.971 |
| Information | 140 | 64085.781 |
| Finance and Insurance | 270 | 74106.454 |
| Real Estate and Rental and Leasing | 115 | 41470.730 |
| Professional, Scientific, and Technical Services | 304 | 81616.629 |
| Management, Administrative and Support, and Waste Management Services | 402 | 26233.206 |
| Educational Services | 625 | 41214.819 |
| Arts, Entertainment, and Recreation | 106 | 32663.576 |
| Accomodations and Food Services | 308 | 23457.610 |
| Other Services (Except Public Administration | 333 | 29123.608 |
| Public Administration and Active Duty Military | 441 | 53055.386 |
| Armed Forces | 12 | 43090.909 |
| Not in Labor Force | 1 | 0.000 |
| Uncodeable | 37 | 37223.486 |
| NA | 6007 | 9680.487 |
# Histogram
nlsy.indinc.obj1 <- ggplot(data = nlsy.indinc1, aes(y = income, x = industry, fill = I("steelblue")))
nlsy.indinc.obj1 + geom_bar(stat = "identity", position = position_dodge(preserve = "single")) +
xlab("Industry") +
ylab("Average income ($)") +
ggtitle("Average income by industry") +
coord_flip(xlim = NULL, ylim = NULL)# Analyze mean income by industry and gender
nlsy.indinc2 <- nlsy %>%
group_by(gender, industry) %>%
summarize(income = mean(income, na.rm = TRUE))## Warning: Factor `industry` contains implicit NA, consider using
## `forcats::fct_explicit_na`
| gender | industry | income |
|---|---|---|
| Male | Health Care and Social Assistance | 72823.775 |
| Male | Agriculture, Forestry, Fishing, and Hunting | 36664.117 |
| Male | Mining | 72252.290 |
| Male | Utilities | 83860.179 |
| Male | Construction | 33772.281 |
| Male | Manufacturing | 60809.464 |
| Male | Wholesale Trade | 68140.036 |
| Male | Retail Trade | 47518.280 |
| Male | Transportation and Warehousing | 53169.996 |
| Male | Information | 85020.792 |
| Male | Finance and Insurance | 134255.172 |
| Male | Real Estate and Rental and Leasing | 40185.222 |
| Male | Professional, Scientific, and Technical Services | 115337.945 |
| Male | Management, Administrative and Support, and Waste Management Services | 29635.603 |
| Male | Educational Services | 55653.855 |
| Male | Arts, Entertainment, and Recreation | 36357.629 |
| Male | Accomodations and Food Services | 35760.416 |
| Male | Other Services (Except Public Administration | 41605.676 |
| Male | Public Administration and Active Duty Military | 64865.785 |
| Male | Armed Forces | 41500.000 |
| Male | Uncodeable | 33815.368 |
| Male | NA | 15467.761 |
| Female | Health Care and Social Assistance | 31143.160 |
| Female | Agriculture, Forestry, Fishing, and Hunting | 13908.333 |
| Female | Mining | 60531.800 |
| Female | Utilities | 56718.750 |
| Female | Construction | 31040.217 |
| Female | Manufacturing | 32504.409 |
| Female | Wholesale Trade | 41498.517 |
| Female | Retail Trade | 19137.980 |
| Female | Transportation and Warehousing | 40951.835 |
| Female | Information | 37219.183 |
| Female | Finance and Insurance | 44203.949 |
| Female | Real Estate and Rental and Leasing | 43157.958 |
| Female | Professional, Scientific, and Technical Services | 48800.584 |
| Female | Management, Administrative and Support, and Waste Management Services | 21206.439 |
| Female | Educational Services | 36623.458 |
| Female | Arts, Entertainment, and Recreation | 26473.541 |
| Female | Accomodations and Food Services | 14137.303 |
| Female | Other Services (Except Public Administration | 18320.415 |
| Female | Public Administration and Active Duty Military | 43747.191 |
| Female | Armed Forces | 47333.333 |
| Female | Not in Labor Force | 0.000 |
| Female | Uncodeable | 41270.625 |
| Female | NA | 5957.862 |
# Histogram
nlsy.indinc.obj2 <- ggplot(data = nlsy.indinc2, aes(y = income, x = industry, fill = gender))
nlsy.indinc.obj2 + geom_bar(stat = "identity", position = 'dodge') +
xlab("Industry") +
ylab("Average income ($)") +
ggtitle("Average income by industry,\nby gender") +
coord_flip(xlim = NULL, ylim = NULL) +
theme(axis.text.x = element_text(angle=60, hjust=1)) The first bar chart above shows that the highest paying industries, on average, are Finance and Insurance, Professional, Scientific, and Technical Services, Information, and Utilities while the lowest paying industries are Management, Administrative Support, and Waste Management Services, Construction, and Accommodations and Food Services.
A side-by-side comparison of average income across industries by gender shows a slight-to-substantial advantage for men across most industries. This analysis suggests that the wage gap between men and women is not due to differences in choice of industry between the two groups, insofar as men tend to outearn women independently of what industry they’re in. The few exceptions are in the areas of Real Estate and Rental and Leasing and the Armed Forces, where women slightly outearn men on average.
Even in those areas where women are more strongly represented, such as Health Care and Social Assistance and Educational Services, men still tend to earn more on average (see next section’s analysis).
# Calculate proportional representation of genders per industry
prop.ind <- prop.table(table(nlsy$industry, nlsy$gender), margin = 2)
prop.ind.per <- round(prop.ind* 100, 2)
kable(prop.ind.per)| Male | Female | |
|---|---|---|
| Health Care and Social Assistance | 5.46 | 23.98 |
| Agriculture, Forestry, Fishing, and Hunting | 1.86 | 0.44 |
| Mining | 0.98 | 0.15 |
| Utilities | 1.77 | 0.50 |
| Construction | 13.60 | 1.38 |
| Manufacturing | 15.76 | 7.56 |
| Wholesale Trade | 3.57 | 1.79 |
| Retail Trade | 7.59 | 9.33 |
| Transportation and Warehousing | 8.35 | 3.47 |
| Information | 2.38 | 1.82 |
| Finance and Insurance | 2.80 | 5.24 |
| Real Estate and Rental and Leasing | 2.01 | 1.44 |
| Professional, Scientific, and Technical Services | 4.54 | 4.56 |
| Management, Administrative and Support, and Waste Management Services | 7.29 | 4.80 |
| Educational Services | 4.51 | 14.03 |
| Arts, Entertainment, and Recreation | 1.98 | 1.21 |
| Accomodations and Food Services | 3.99 | 5.21 |
| Other Services (Except Public Administration | 4.73 | 5.24 |
| Public Administration and Active Duty Military | 5.95 | 7.24 |
| Armed Forces | 0.27 | 0.09 |
| Not in Labor Force | 0.00 | 0.03 |
| Uncodeable | 0.61 | 0.50 |
# Graph proportional representation of genders per industry
prop.ind.df <- as.data.frame(prop.ind)
prop.ind.obj <- ggplot(data = prop.ind.df, aes(y = Freq, x = Var1, fill = Var2)) +
theme(legend.title = element_blank())
prop.ind.obj + geom_bar(stat = "identity", position = position_dodge(preserve = "single")) +
xlab("Industry") +
ylab("Proportion") +
ggtitle("Proportional representation \nper industry, by gender") +
coord_flip(xlim = NULL, ylim = NULL) # Graph difference in proportions
prop.ind.per.df <- as.data.frame(prop.ind.per)
prop.ind.diff <- prop.ind.per.df %>%
group_by(Var1) %>%
summarize(prop.gap = Freq[Var2 == "Male"] - Freq[Var2 == "Female"])
prop.ind.diff## # A tibble: 22 x 2
## Var1 prop.gap
## <fct> <dbl>
## 1 Health Care and Social Assistance -18.5
## 2 Agriculture, Forestry, Fishing, and Hunting 1.42
## 3 Mining 0.83
## 4 Utilities 1.27
## 5 Construction 12.2
## 6 Manufacturing 8.2
## 7 Wholesale Trade 1.78
## 8 Retail Trade -1.74
## 9 Transportation and Warehousing 4.88
## 10 Information 0.560
## # ... with 12 more rows
prop.ind.obj2 <- ggplot(data = prop.ind.diff, aes(y = prop.gap, x = Var1, fill = I("steelblue"))) +
theme(legend.title = element_blank())
prop.ind.obj2 + geom_bar(stat = "identity") +
xlab("Industry") +
ylab("Percentage difference") +
ggtitle("Proportional representation \nper industry, male - female") +
coord_flip(xlim = NULL, ylim = NULL)The side-by-side bar chart above shows that the representation gap between men and women varies across industries, with women being more strongly represented in such industries as Health Care and Social Assistance and Educational Services and men being more strongly represented in Construction and Manufacturing. If those industries for which men were more strongly represented also tended to correspond to higher salaries on average, then this disparity might partially explain the wage gap we observe between men and women. If there is no such correlation, however, then this would not be a likely explanation for the gap we observe.
Referring back to the previous section’s analysis, we find no such correlation between high paying professions and representativeness. Men predominate in two out of the three lowest paying industries noted above, while women predominate in the highest paying industry, Finance and Insurance, are equally represented in the second highest paying industry, and are only slightly underrepresented in the remaining two highest-paying industries. While this analysis doesn’t eliminate the possibility that people’s choice of industry contributes to the wage gap we observe between men and women, it does somewhat weaken the case in favor of that explanation. More clarity could be gained into this relationship by performing a more granular analysis of income by profession using the occupation variable from our base data set. I will not pursue that analysis in this report, however.
In the final section of my analysis of the relationship between income and industry, I compare the wage gap within each industry to further validate the results of the previous sections’ analyses.
# Calculate wage gap by industry
wage.gap.ind <- nlsy %>%
group_by(industry) %>%
summarize(male = mean(income[gender == "Male"], na.rm = TRUE),
female = mean(income[gender == "Female"], na.rm = TRUE),
income.gap = mean(income[gender == "Male"], na.rm = TRUE) -
mean(income[gender == "Female"], na.rm = TRUE))## Warning: Factor `industry` contains implicit NA, consider using
## `forcats::fct_explicit_na`
| industry | male | female | income.gap |
|---|---|---|---|
| Health Care and Social Assistance | 72823.77 | 31143.160 | 41680.615 |
| Agriculture, Forestry, Fishing, and Hunting | 36664.12 | 13908.333 | 22755.783 |
| Mining | 72252.29 | 60531.800 | 11720.490 |
| Utilities | 83860.18 | 56718.750 | 27141.429 |
| Construction | 33772.28 | 31040.217 | 2732.064 |
| Manufacturing | 60809.46 | 32504.409 | 28305.055 |
| Wholesale Trade | 68140.04 | 41498.517 | 26641.519 |
| Retail Trade | 47518.28 | 19137.980 | 28380.300 |
| Transportation and Warehousing | 53170.00 | 40951.835 | 12218.161 |
| Information | 85020.79 | 37219.183 | 47801.609 |
| Finance and Insurance | 134255.17 | 44203.949 | 90051.224 |
| Real Estate and Rental and Leasing | 40185.22 | 43157.958 | -2972.736 |
| Professional, Scientific, and Technical Services | 115337.94 | 48800.584 | 66537.361 |
| Management, Administrative and Support, and Waste Management Services | 29635.60 | 21206.439 | 8429.164 |
| Educational Services | 55653.86 | 36623.458 | 19030.397 |
| Arts, Entertainment, and Recreation | 36357.63 | 26473.541 | 9884.088 |
| Accomodations and Food Services | 35760.42 | 14137.303 | 21623.113 |
| Other Services (Except Public Administration | 41605.68 | 18320.415 | 23285.260 |
| Public Administration and Active Duty Military | 64865.78 | 43747.191 | 21118.594 |
| Armed Forces | 41500.00 | 47333.333 | -5833.333 |
| Not in Labor Force | NaN | 0.000 | NaN |
| Uncodeable | 33815.37 | 41270.625 | -7455.257 |
| NA | 15467.76 | 5957.862 | 9509.898 |
# Reorder industries by wage gap
wage.gap.ind <- mutate(wage.gap.ind, industry = reorder(industry, income.gap))
# Graph wage gap by industry
ggplot(data = wage.gap.ind, aes(x = industry, y = income.gap, fill = I("steelblue"))) +
geom_bar(stat = "identity") +
xlab("Industry") +
ylab("Income gap ($)") +
ggtitle("Income gap between men \nand women, by industry") +
coord_flip(xlim = NULL, ylim = NULL) +
guides(fill = FALSE) ## Warning: Removed 1 rows containing missing values (position_stack).
The bar chart above shows that wage gap between men and women is far more pronounced in certain industries than in others, with most of these disparities favoring men. The widest disparities in earnings are in the areas of Finance and Insurance, Professional, Scientific and Technical Services, and Information, while the smallest disparities are in the areas of Construction, Real Estate and Rental and Leasing, and Armed Forces, with the latter two categories tending to favor women. In other words, in the industries in which women have an advantage, that advantage tends to be very modest, where the advantage is much larger in those industries that favor men. This analysis provides further evidence that occupational choice is likely not driving the difference we observe in the wages of men and women, although it is an important qualifier, for the reasons just mentioned.
Next, let’s look at the relationship between income and educational attainment (highest_grade).
# Educational attainment tables and graphs
# Analyze mean income by educational attainment
# Boxplot of income by educational attainment
plot_ly(nlsy, x = ~highest_grade, y = ~income, color = ~highest_grade, type = "box")## Warning: Ignoring 5662 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
# Create nlsy.eduinc
nlsy.eduinc <- nlsy %>%
group_by(gender, highest_grade) %>%
summarize(count = n(), income = mean(income, na.rm = TRUE))## Warning: Factor `highest_grade` contains implicit NA, consider using
## `forcats::fct_explicit_na`
| gender | highest_grade | count | income |
|---|---|---|---|
| Male | 12th grade | 1644 | 35593.851 |
| Male | 3rd grade | 2 | 34000.000 |
| Male | 4th grade | 4 | 15175.000 |
| Male | 5th grade | 3 | 18666.667 |
| Male | 6th grade | 11 | 21181.818 |
| Male | 7th grade | 17 | 10381.250 |
| Male | 8th grade | 62 | 15947.492 |
| Male | 9th grade | 98 | 21481.411 |
| Male | 10th grade | 83 | 15726.840 |
| Male | 11th grade | 111 | 17978.963 |
| Male | 1st year college | 271 | 50073.200 |
| Male | 2nd year college | 323 | 52006.051 |
| Male | 3rd year college | 149 | 60522.845 |
| Male | 4th year college | 410 | 99372.714 |
| Male | 5th year college | 81 | 87994.613 |
| Male | 6th year college | 119 | 126561.248 |
| Male | 7th year college | 46 | 124460.444 |
| Male | 8th year college or more | 90 | 165950.531 |
| Male | NA | 2879 | NaN |
| Female | 12th grade | 1534 | 20820.893 |
| Female | None | 2 | 0.000 |
| Female | 3rd grade | 7 | 9257.143 |
| Female | 4th grade | 2 | 19000.000 |
| Female | 5th grade | 2 | 0.000 |
| Female | 6th grade | 21 | 4263.810 |
| Female | 7th grade | 22 | 5409.091 |
| Female | 8th grade | 40 | 4034.595 |
| Female | 9th grade | 68 | 7421.875 |
| Female | 10th grade | 73 | 6521.145 |
| Female | 11th grade | 78 | 10127.158 |
| Female | 1st year college | 367 | 27772.804 |
| Female | 2nd year college | 428 | 30467.777 |
| Female | 3rd year college | 232 | 28718.323 |
| Female | 4th year college | 459 | 47055.389 |
| Female | 5th year college | 126 | 47855.312 |
| Female | 6th year college | 173 | 53895.641 |
| Female | 7th year college | 77 | 70402.479 |
| Female | 8th year college or more | 66 | 72784.015 |
| Female | NA | 2506 | NaN |
# Histogram
nlsy.eduinc.obj <- ggplot(data = nlsy.eduinc, aes(y = income, x = highest_grade, fill = gender)) +
theme(legend.title = element_blank())
nlsy.eduinc.obj + geom_bar(stat = "identity", position = position_dodge(preserve = "single")) +
xlab("Highest grade completed") +
ylab("Average income") +
ggtitle("Average income by educational attainment, by gender") +
theme(axis.text.x = element_text(angle=60, hjust=1)) ## Warning: Removed 2 rows containing missing values (geom_bar).
In the boxplot provided above, you can see the generally positive effects of additional years of education on average income earned. The boxplot also allows us to see how additional years of education influence the range of incomes that become accessible to people in each class. Many of the higher income categories (e.g., above $100k/year) are reserved almost excusively for those possessing at least a high school diploma (i.e., completed up to 12 years of education). Around the $350k mark, you can see the top-coded values of those earning significantly more than the bulk of the distribution for each class. We can therefore assume that the real average for these higher educational levels is actually somewhat higher than what is displayed, although such outcomes are rare.
The bar chart similarly shows a positive correlation between level of education attainment and average income. With a few minor exceptions, average income tends to increase with every additional level of educational attainment, for both men and women. There are a few minor deviations from this trend among levels of grade school as well as college, but some of these differences likely fall within the margin of error for those measurements, so should not be interpreted as significant. Among the major classes of educational attainment, e.g., from grade school to a bachelors degree, and between different levels of higher education, the difference is much more significant.
Notably, the positive effect of educational attainment on income is much more pronounced for men than women, a pattern that holds across virtually every category of education. The sole exception is for those with a 4th grade education, though again, this difference is likely within the margin of error for this category (n = 9), and therefore should not be interpreted as significant.
# Calculate proportional representation of genders per level of educational attainment
prop.edu <- prop.table(table(nlsy$highest_grade, nlsy$gender), margin = 2)
prop.edu.per <- round(prop.edu* 100, 2)
kable(prop.edu.per)| Male | Female | |
|---|---|---|
| 12th grade | 46.65 | 40.61 |
| None | 0.00 | 0.05 |
| 3rd grade | 0.06 | 0.19 |
| 4th grade | 0.11 | 0.05 |
| 5th grade | 0.09 | 0.05 |
| 6th grade | 0.31 | 0.56 |
| 7th grade | 0.48 | 0.58 |
| 8th grade | 1.76 | 1.06 |
| 9th grade | 2.78 | 1.80 |
| 10th grade | 2.36 | 1.93 |
| 11th grade | 3.15 | 2.07 |
| 1st year college | 7.69 | 9.72 |
| 2nd year college | 9.17 | 11.33 |
| 3rd year college | 4.23 | 6.14 |
| 4th year college | 11.63 | 12.15 |
| 5th year college | 2.30 | 3.34 |
| 6th year college | 3.38 | 4.58 |
| 7th year college | 1.31 | 2.04 |
| 8th year college or more | 2.55 | 1.75 |
# Graph proportional representation of genders per level of educational attainment
prop.edu.df <- as.data.frame(prop.edu)
prop.edu.obj1 <- ggplot(data = prop.edu.df, aes(y = Freq, x = Var1, fill = Var2)) +
theme(legend.title = element_blank())
prop.edu.obj1 + geom_bar(stat = "identity", position = 'dodge') +
xlab("Highest grade completed") +
ylab("Proportion") +
ggtitle("Proportional representation per level of educational \nattainment, by gender") +
theme(axis.text.x = element_text(angle=60, hjust=1)) # Graph difference in proportions
prop.edu.per.df <- as.data.frame(prop.edu.per)
prop.edu.diff <- prop.edu.per.df %>%
group_by(Var1) %>%
summarize(prop.gap = Freq[Var2 == "Male"] - Freq[Var2 == "Female"])
prop.edu.diff## # A tibble: 19 x 2
## Var1 prop.gap
## <fct> <dbl>
## 1 12th grade 6.04
## 2 None -0.05
## 3 3rd grade -0.13
## 4 4th grade 0.06
## 5 5th grade 0.0400
## 6 6th grade -0.25
## 7 7th grade -0.100
## 8 8th grade 0.7
## 9 9th grade 0.980
## 10 10th grade 0.430
## 11 11th grade 1.08
## 12 1st year college -2.03
## 13 2nd year college -2.16
## 14 3rd year college -1.91
## 15 4th year college -0.520
## 16 5th year college -1.04
## 17 6th year college -1.2
## 18 7th year college -0.73
## 19 8th year college or more 0.800
prop.edu.obj2 <- ggplot(data = prop.edu.diff, aes(y = prop.gap, x = Var1, fill = I("steelblue"))) +
theme(legend.title = element_blank())
prop.edu.obj2 + geom_bar(stat = "identity") +
xlab("Highest grade completed") +
ylab("Percentage difference") +
ggtitle("Proportional representation per level of educational \nattainment, male - female") +
theme(axis.text.x = element_text(angle=60, hjust=1)) The two bar charts above show the male-to-female proportional representation per level of educational attainment and male - female difference in proportional representation per level of educational attainment, respectively. While male and female respondents were represented nearly equally across all categories, the slight differences that do exist are telling. Specifically, we find that men are more strongly represented among those who completed up to some high school (9th to 12th grade) and 8 or more years of college, while women are more strongly represented among those who completed up to some college (1 to 7 years). In other words, female respondents were on average better educated than men across the entire sample.
This analysis, like the previous one, provides evidence against the first alternative hypothesis, which proposed that women may be earning lower incomes because of lower educational attainment compared to men. In fact, what this analysis shows is that men are earning more despite having lower educational qualifications than their female counterparts, which is precisely the opposite of what this hypothesis predicted.
In the final section of my analysis of the relationship between income and professional qualifications, I compare the wage gap within each level of educational attainment to further validate the results of the previous sections’ analyses.
# Calculate wage gap by level of educational attainment
wage.gap.edu <- nlsy %>%
group_by(highest_grade) %>%
summarize(count = n(), male = mean(income[gender == "Male"], na.rm = TRUE),
female = mean(income[gender == "Female"], na.rm = TRUE),
income.gap = mean(income[gender == "Male"], na.rm = TRUE) -
mean(income[gender == "Female"], na.rm = TRUE))## Warning: Factor `highest_grade` contains implicit NA, consider using
## `forcats::fct_explicit_na`
| highest_grade | count | male | female | income.gap |
|---|---|---|---|---|
| 12th grade | 3178 | 35593.85 | 20820.893 | 14772.958 |
| None | 2 | NaN | 0.000 | NaN |
| 3rd grade | 9 | 34000.00 | 9257.143 | 24742.857 |
| 4th grade | 6 | 15175.00 | 19000.000 | -3825.000 |
| 5th grade | 5 | 18666.67 | 0.000 | 18666.667 |
| 6th grade | 32 | 21181.82 | 4263.810 | 16918.009 |
| 7th grade | 39 | 10381.25 | 5409.091 | 4972.159 |
| 8th grade | 102 | 15947.49 | 4034.595 | 11912.897 |
| 9th grade | 166 | 21481.41 | 7421.875 | 14059.536 |
| 10th grade | 156 | 15726.84 | 6521.145 | 9205.695 |
| 11th grade | 189 | 17978.96 | 10127.158 | 7851.805 |
| 1st year college | 638 | 50073.20 | 27772.804 | 22300.396 |
| 2nd year college | 751 | 52006.05 | 30467.777 | 21538.275 |
| 3rd year college | 381 | 60522.85 | 28718.323 | 31804.522 |
| 4th year college | 869 | 99372.71 | 47055.389 | 52317.325 |
| 5th year college | 207 | 87994.61 | 47855.312 | 40139.300 |
| 6th year college | 292 | 126561.25 | 53895.641 | 72665.607 |
| 7th year college | 123 | 124460.44 | 70402.479 | 54057.965 |
| 8th year college or more | 156 | 165950.53 | 72784.015 | 93166.515 |
| NA | 5385 | NaN | NaN | NaN |
# Graph wage gap by level of educational attainment
ggplot(data = wage.gap.edu, aes(x = highest_grade, y = income.gap, fill = I("steelblue"))) +
geom_bar(stat = "identity") +
xlab("Highest grade completed") +
ylab("Income gap ($)") +
ggtitle("Income gap between men and women, by level of educational attainment") +
theme(axis.text.x = element_text(angle=60, hjust=1)) ## Warning: Removed 2 rows containing missing values (position_stack).
The bar chart above shows that wage gap between men and women increases as level of educational attainment increases, in favor of men. We see slight drops at irregular intervals, such as 5 years and 7 years of college, which might represent individuals who stopped short of completing a higher level degree, such as a masters, doctorate, or professional degree. Alternatively, it may just represent a small sample size - and therefore larger margin of error - for these categories.
This analysis provides a first line of evidence that professional qualifications are likely not driving the difference we observe in the wages of men and women, insofar as men are benefiting more on average from the positive relationship between educational attainment and income, despite women having the stronger educational credentials on average. The last factor we’ll consider in our evaluation of the first alternative hypothesis is number of jobs, which is being used here as a proxy for professional experience.
# Number of jobs tables and graphs
# Analyze mean income by number of jobs
nlsy.jobsinc <- nlsy %>%
group_by(gender, jobs_number) %>%
summarize(count = n(), income = mean(income, na.rm = TRUE))
# Kable
kable(nlsy.jobsinc)| gender | jobs_number | count | income |
|---|---|---|---|
| Male | 0 | 6 | 0.000 |
| Male | 1 | 29 | 40194.138 |
| Male | 2 | 64 | 38122.949 |
| Male | 3 | 80 | 47883.633 |
| Male | 4 | 135 | 61320.128 |
| Male | 5 | 185 | 71927.517 |
| Male | 6 | 183 | 62999.920 |
| Male | 7 | 190 | 63784.467 |
| Male | 8 | 207 | 59528.240 |
| Male | 9 | 228 | 65235.624 |
| Male | 10 | 229 | 61215.014 |
| Male | 11 | 213 | 65989.446 |
| Male | 12 | 193 | 57901.611 |
| Male | 13 | 194 | 51173.299 |
| Male | 14 | 170 | 57591.124 |
| Male | 15 | 120 | 54903.276 |
| Male | 16 | 157 | 55533.842 |
| Male | 17 | 134 | 45091.077 |
| Male | 18 | 115 | 42553.054 |
| Male | 19 | 93 | 52805.000 |
| Male | 20 | 104 | 32300.767 |
| Male | 21 | 74 | 31556.740 |
| Male | 22 | 65 | 35930.969 |
| Male | 23 | 44 | 35292.744 |
| Male | 24 | 48 | 29415.375 |
| Male | 25 | 35 | 35546.057 |
| Male | 26 | 37 | 19235.000 |
| Male | 27 | 27 | 37642.222 |
| Male | 28 | 27 | 44917.560 |
| Male | 29 | 26 | 31633.923 |
| Male | 30 | 21 | 26405.056 |
| Male | 31 | 17 | 33667.750 |
| Male | 32 | 12 | 16510.917 |
| Male | 33 | 9 | 23966.667 |
| Male | 34 | 15 | 27831.467 |
| Male | 35 | 7 | 17585.714 |
| Male | 36 | 6 | 17666.667 |
| Male | 37 | 2 | 31500.000 |
| Male | 38 | 6 | 17166.667 |
| Male | 39 | 3 | 17933.333 |
| Male | 40 | 3 | 5066.667 |
| Male | 41 | 3 | 0.000 |
| Male | 42 | 2 | 46660.000 |
| Male | 45 | 2 | 32500.000 |
| Male | 46 | 1 | 0.000 |
| Male | 48 | 1 | 53000.000 |
| Male | 51 | 1 | 0.000 |
| Male | 52 | 1 | 0.000 |
| Male | NA | 2879 | NaN |
| Female | 0 | 24 | 0.000 |
| Female | 1 | 47 | 8500.000 |
| Female | 2 | 90 | 23874.831 |
| Female | 3 | 119 | 21838.922 |
| Female | 4 | 170 | 28158.497 |
| Female | 5 | 202 | 26475.857 |
| Female | 6 | 240 | 33109.662 |
| Female | 7 | 256 | 27601.757 |
| Female | 8 | 281 | 29636.307 |
| Female | 9 | 234 | 35015.523 |
| Female | 10 | 234 | 32428.815 |
| Female | 11 | 220 | 30439.252 |
| Female | 12 | 233 | 30574.814 |
| Female | 13 | 177 | 29223.560 |
| Female | 14 | 184 | 37431.529 |
| Female | 15 | 181 | 32352.938 |
| Female | 16 | 140 | 30429.701 |
| Female | 17 | 105 | 28461.265 |
| Female | 18 | 114 | 30736.964 |
| Female | 19 | 91 | 31286.793 |
| Female | 20 | 74 | 25816.958 |
| Female | 21 | 51 | 25668.438 |
| Female | 22 | 57 | 24111.255 |
| Female | 23 | 44 | 28983.568 |
| Female | 24 | 37 | 40409.611 |
| Female | 25 | 28 | 16593.192 |
| Female | 26 | 29 | 28499.750 |
| Female | 27 | 22 | 41228.227 |
| Female | 28 | 19 | 28161.111 |
| Female | 29 | 15 | 31640.000 |
| Female | 30 | 15 | 22476.333 |
| Female | 31 | 7 | 39428.571 |
| Female | 32 | 6 | 9823.667 |
| Female | 33 | 4 | 6050.000 |
| Female | 34 | 3 | 21000.000 |
| Female | 35 | 5 | 20164.000 |
| Female | 36 | 2 | 11680.000 |
| Female | 37 | 5 | 15179.000 |
| Female | 38 | 3 | 20000.000 |
| Female | 41 | 4 | 34750.000 |
| Female | 44 | 1 | 55000.000 |
| Female | 45 | 1 | 0.000 |
| Female | 47 | 2 | 30194.000 |
| Female | 58 | 1 | 50000.000 |
| Female | NA | 2506 | NaN |
# Histogram
nlsy.jobsinc.obj <- ggplot(data = nlsy.jobsinc, aes(y = income, x = jobs_number, fill = gender)) +
theme(legend.title = element_blank())
nlsy.jobsinc.obj + geom_bar(stat = "identity", position = position_dodge(preserve = "single")) +
xlab("Number of jobs") +
ylab("Average income ($)") +
ggtitle("Average income per number of jobs, by gender")## Warning: Removed 2 rows containing missing values (geom_bar).
The bar chart above shows that men are again earning higher incomes on average across most of the range in job numbers. It is not especially clear, either, from this graph what the precise nature of the relationship is between number of jobs and income, except perhaps in the case of a few exceptional individuals at the highest extreme of the distribution, who appear to be benefited by having held more jobs. What we may be seeing here is just the effect of age, with number of jobs held serving as a proxy of the person’s age rather than necessarily their experience. For those respondents who reported holding between 20 and 40 jobs, however, the effect on income appears rather erratic, and possibly even negative. For those individuals who have held a number of jobs close to the average for the sample population, income appears to be more or less stable, suggesting that the effect of this variable on income may be minimal.
# Calculate proportional representation of genders per number of jobs
prop.jobs <- prop.table(table(nlsy$jobs_number, nlsy$gender), margin = 2)
prop.jobs.per <- round(prop.jobs * 100, 2)
kable(prop.jobs.per)| Male | Female | |
|---|---|---|
| 0 | 0.17 | 0.64 |
| 1 | 0.82 | 1.24 |
| 2 | 1.82 | 2.38 |
| 3 | 2.27 | 3.15 |
| 4 | 3.83 | 4.50 |
| 5 | 5.25 | 5.35 |
| 6 | 5.19 | 6.35 |
| 7 | 5.39 | 6.78 |
| 8 | 5.87 | 7.44 |
| 9 | 6.47 | 6.20 |
| 10 | 6.50 | 6.20 |
| 11 | 6.04 | 5.82 |
| 12 | 5.48 | 6.17 |
| 13 | 5.51 | 4.69 |
| 14 | 4.82 | 4.87 |
| 15 | 3.41 | 4.79 |
| 16 | 4.46 | 3.71 |
| 17 | 3.80 | 2.78 |
| 18 | 3.26 | 3.02 |
| 19 | 2.64 | 2.41 |
| 20 | 2.95 | 1.96 |
| 21 | 2.10 | 1.35 |
| 22 | 1.84 | 1.51 |
| 23 | 1.25 | 1.16 |
| 24 | 1.36 | 0.98 |
| 25 | 0.99 | 0.74 |
| 26 | 1.05 | 0.77 |
| 27 | 0.77 | 0.58 |
| 28 | 0.77 | 0.50 |
| 29 | 0.74 | 0.40 |
| 30 | 0.60 | 0.40 |
| 31 | 0.48 | 0.19 |
| 32 | 0.34 | 0.16 |
| 33 | 0.26 | 0.11 |
| 34 | 0.43 | 0.08 |
| 35 | 0.20 | 0.13 |
| 36 | 0.17 | 0.05 |
| 37 | 0.06 | 0.13 |
| 38 | 0.17 | 0.08 |
| 39 | 0.09 | 0.00 |
| 40 | 0.09 | 0.00 |
| 41 | 0.09 | 0.11 |
| 42 | 0.06 | 0.00 |
| 44 | 0.00 | 0.03 |
| 45 | 0.06 | 0.03 |
| 46 | 0.03 | 0.00 |
| 47 | 0.00 | 0.05 |
| 48 | 0.03 | 0.00 |
| 51 | 0.03 | 0.00 |
| 52 | 0.03 | 0.00 |
| 58 | 0.00 | 0.03 |
# Graph proportional representation of genders per number of jobs
prop.jobs.df <- as.data.frame(prop.jobs)
prop.jobs.obj1 <- ggplot(data = prop.jobs.df, aes(y = Freq, x = Var1, fill = Var2)) +
theme(legend.title = element_blank())
prop.jobs.obj1 + geom_bar(stat = "identity", position = position_dodge(preserve = "single")) +
xlab("Number of jobs") +
ylab("Proportion") +
ggtitle("Proportional representation per number of jobs, by gender") +
theme(axis.text.x = element_text(angle=90, hjust=1)) # Graph difference in proportions
prop.jobs.per.df <- as.data.frame(prop.jobs.per)
prop.jobs.diff <- prop.jobs.per.df %>%
group_by(Var1) %>%
summarize(prop.gap = Freq[Var2 == "Male"] - Freq[Var2 == "Female"])
prop.jobs.diff## # A tibble: 51 x 2
## Var1 prop.gap
## <fct> <dbl>
## 1 0 -0.47
## 2 1 -0.42
## 3 2 -0.560
## 4 3 -0.880
## 5 4 -0.67
## 6 5 -0.1000
## 7 6 -1.16
## 8 7 -1.39
## 9 8 -1.57
## 10 9 0.270
## # ... with 41 more rows
prop.jobs.obj2 <- ggplot(data = prop.jobs.diff, aes(y = prop.gap, x = Var1, fill = I("steelblue"))) +
theme(legend.title = element_blank())
prop.jobs.obj2 + geom_bar(stat = "identity") +
xlab("Number of jobs") +
ylab("Percentage difference") +
ggtitle("Proportional representation per number of jobs, male - female") +
coord_flip(xlim = NULL, ylim = NULL) The first bar chart above shows an nearly identical distribution of men and women across the range of number of jobs held, suggesting approximate balance between the genders with respect to this variable. Since neither gender has substantially higher representation in any category along this range, number of jobs is unlikely to explain any difference in income between the genders, regardless of the nature of its relationship to income.
The second bar chart shows a slight disparity in representation of the genders across number of jobs, with women having slightly (<2%) higher representation among the highest ranges of jobs held and men having a slightly higher representation (>1%) among the middle and lower ranges of jobs held. Consistent with our choice above to use number of jobs as a proxy for professional experience, this analysis would suggest that women are more strongly representated among the most experienced categories of workers, again contradicting what was proposed by the first alternative hypothesis.
# Calculate wage gap by level of educational attainment
wage.gap.jobs <- nlsy %>%
group_by(jobs_number) %>%
summarize(count = n(), male = mean(income[gender == "Male"], na.rm = TRUE),
female = mean(income[gender == "Female"], na.rm = TRUE),
income.gap = mean(income[gender == "Male"], na.rm = TRUE) -
mean(income[gender == "Female"], na.rm = TRUE))
# Kable
kable(wage.gap.jobs)| jobs_number | count | male | female | income.gap |
|---|---|---|---|---|
| 0 | 30 | 0.000 | 0.000 | 0.000000 |
| 1 | 76 | 40194.138 | 8500.000 | 31694.137931 |
| 2 | 154 | 38122.949 | 23874.831 | 14248.117827 |
| 3 | 199 | 47883.633 | 21838.922 | 26044.711172 |
| 4 | 305 | 61320.128 | 28158.497 | 33161.630887 |
| 5 | 387 | 71927.517 | 26475.857 | 45451.659903 |
| 6 | 423 | 62999.920 | 33109.662 | 29890.258117 |
| 7 | 446 | 63784.467 | 27601.757 | 36182.710190 |
| 8 | 488 | 59528.240 | 29636.307 | 29891.933182 |
| 9 | 462 | 65235.624 | 35015.523 | 30220.100917 |
| 10 | 463 | 61215.014 | 32428.815 | 28786.198658 |
| 11 | 433 | 65989.446 | 30439.252 | 35550.193697 |
| 12 | 426 | 57901.611 | 30574.814 | 27326.796047 |
| 13 | 371 | 51173.299 | 29223.560 | 21949.739465 |
| 14 | 354 | 57591.124 | 37431.529 | 20159.595488 |
| 15 | 301 | 54903.276 | 32352.938 | 22550.338362 |
| 16 | 297 | 55533.842 | 30429.701 | 25104.140613 |
| 17 | 239 | 45091.077 | 28461.265 | 16629.811617 |
| 18 | 229 | 42553.054 | 30736.964 | 11816.090418 |
| 19 | 184 | 52805.000 | 31286.793 | 21518.206897 |
| 20 | 178 | 32300.767 | 25816.958 | 6483.809244 |
| 21 | 125 | 31556.740 | 25668.438 | 5888.302226 |
| 22 | 122 | 35930.969 | 24111.255 | 11819.714205 |
| 23 | 88 | 35292.744 | 28983.568 | 6309.176004 |
| 24 | 85 | 29415.375 | 40409.611 | -10994.236111 |
| 25 | 63 | 35546.057 | 16593.192 | 18952.864835 |
| 26 | 66 | 19235.000 | 28499.750 | -9264.750000 |
| 27 | 49 | 37642.222 | 41228.227 | -3586.005050 |
| 28 | 46 | 44917.560 | 28161.111 | 16756.448889 |
| 29 | 41 | 31633.923 | 31640.000 | -6.076923 |
| 30 | 36 | 26405.056 | 22476.333 | 3928.722222 |
| 31 | 24 | 33667.750 | 39428.571 | -5760.821429 |
| 32 | 18 | 16510.917 | 9823.667 | 6687.250000 |
| 33 | 13 | 23966.667 | 6050.000 | 17916.666667 |
| 34 | 18 | 27831.467 | 21000.000 | 6831.466667 |
| 35 | 12 | 17585.714 | 20164.000 | -2578.285714 |
| 36 | 8 | 17666.667 | 11680.000 | 5986.666667 |
| 37 | 7 | 31500.000 | 15179.000 | 16321.000000 |
| 38 | 9 | 17166.667 | 20000.000 | -2833.333333 |
| 39 | 3 | 17933.333 | NaN | NaN |
| 40 | 3 | 5066.667 | NaN | NaN |
| 41 | 7 | 0.000 | 34750.000 | -34750.000000 |
| 42 | 2 | 46660.000 | NaN | NaN |
| 44 | 1 | NaN | 55000.000 | NaN |
| 45 | 3 | 32500.000 | 0.000 | 32500.000000 |
| 46 | 1 | 0.000 | NaN | NaN |
| 47 | 2 | NaN | 30194.000 | NaN |
| 48 | 1 | 53000.000 | NaN | NaN |
| 51 | 1 | 0.000 | NaN | NaN |
| 52 | 1 | 0.000 | NaN | NaN |
| 58 | 1 | NaN | 50000.000 | NaN |
| NA | 5385 | NaN | NaN | NaN |
# Graph wage gap by level of educational attainment
ggplot(data = wage.gap.jobs, aes(x = jobs_number, y = income.gap, fill = I("steelblue"))) +
geom_bar(stat = "identity") +
xlab("Number of jobs") +
ylab("Income gap ($)") +
ggtitle("Income gap between men and women, by number of jobs")## Warning: Removed 11 rows containing missing values (position_stack).
The bar chart above shows a general persistent trend of men earning higher incomes regardless of number of jobs. The wage gap is highest near the center of the distribution (around 7-8 jobs), suggesting that number of jobs might not be a particularly significant factor in determining income level. There are large disparities at the extreme high end of the distribution, likely representing a small number exceptional cases. SOmething interesting may be happening here, but whatever that thing is, it likely won’t be generalizable to the broader population.
All in all, our analysis of this variable has not been particularly informative. Consequently, it will be excluded from my final findings.
Having examined the relationship between income and professional qualifications and occupational choices, we’ll now move on to evaluate our second alternative hypothesis, i.e., that the wage gap between men and women is a result of family dynamics. To test this hypothesis, we’ll look at three variables that, taken together, represent the most intuitive sources of possible influence on an individual’s professional decisions and outcomes
# Alternative Hypothesis 2 tables and graphs - family dynamics
# Marital Status tables and graphs
# Boxplot of income by marital status
plot_ly(nlsy, x = ~marital_status, y = ~income, color = ~marital_status, type = "box")## Warning: Ignoring 6177 observations
# Violin plot of income by marital status
qplot(x = marital_status, y = income, data = nlsy, geom = "violin", fill = marital_status)## Warning: Removed 5662 rows containing non-finite values (stat_ydensity).
# Analyze mean income by marital status
nlsy.marinc <- nlsy %>%
group_by(gender, marital_status) %>%
summarize(count = n(), income = mean(income, na.rm = TRUE))## Warning: Factor `marital_status` contains implicit NA, consider using
## `forcats::fct_explicit_na`
| gender | marital_status | count | income |
|---|---|---|---|
| Male | Never married | 889 | 31129.53 |
| Male | Married | 2271 | 69747.32 |
| Male | Separated | 195 | 29194.01 |
| Male | Divorced | 545 | 40017.00 |
| Male | Widowed | 17 | 30776.92 |
| Male | NA | 2486 | 35712.16 |
| Female | Never married | 689 | 27591.87 |
| Female | Married | 2395 | 31809.81 |
| Female | Separated | 288 | 22226.29 |
| Female | Divorced | 689 | 29151.98 |
| Female | Widowed | 52 | 23127.38 |
| Female | NA | 2170 | 24445.27 |
# Histogram
nlsy.marinc.obj <- ggplot(data = nlsy.marinc, aes(y = income, x = marital_status, fill = gender)) +
theme(legend.title = element_blank())
nlsy.marinc.obj + geom_bar(stat = "identity", position = position_dodge(preserve = "single")) +
xlab("Marital Status") +
ylab("Average income ($)") +
ggtitle("Average income per marital status, by gender")The bar chart above shows the average income per category of marital status, separated by gender. You can see that men earn higher incomes on average than women across all categories of martial status. The difference is most pronounced among married individuals and least pronounced among those who have never been married. The first alternative hypothesis offers a possible explanation for this trend, which is that single individuals, whether male or female, are less likely to be burdened by the responsibilities of parenthood and therefore can devote more energy and attention to their careers, and perhaps even compete for more competitive high paying jobs. In contrast, married individuals are more likely both to have children, as well as to share incomes with their partners. Both of these factors - i.e., having larger families and sharing income with their spouse) would, according to this hypothesis, lead us to expect a decrease in the wages of women relative to men, as couples shift the burdens of parenthood disproportionately onto one partner to allow the remaining partner to fill the role of “breadwinner.” The analyses that follow will help evaluate whether the data suports this explanation.
The boxplot similarly shows higher median incomes for married respondents compared to all other categories, with the upper fence value reaching significantly higher than those of the other categories. The violin plot shows how the population is distributed throughout the various portions of the range for income, with married, divorced and widowed categories being the only categories to feature a somewhat even concentrations of the population into the higher income ranges, with all other categories tapering off pretty precipitously as incomes increase.
# Calculate proportional representation of genders per marital status
prop.mar <- prop.table(table(nlsy$marital_status, nlsy$gender), margin = 2)
prop.mar.per <- round(prop.mar * 100, 2)
kable(prop.mar.per)| Male | Female | |
|---|---|---|
| Never married | 22.70 | 16.75 |
| Married | 57.98 | 58.23 |
| Separated | 4.98 | 7.00 |
| Divorced | 13.91 | 16.75 |
| Widowed | 0.43 | 1.26 |
# Graph proportional representation of genders per number of jobs
prop.mar.df <- as.data.frame(prop.mar)
prop.mar.obj1 <- ggplot(data = prop.mar.df, aes(y = Freq, x = Var1, fill = Var2)) +
theme(legend.title = element_blank())
prop.mar.obj1 + geom_bar(stat = "identity", position = 'dodge') +
xlab("Marital Status") +
ylab("Proportion") +
ggtitle("Proportional representation per marital status, by gender") # Graph difference in proportions
prop.mar.per.df <- as.data.frame(prop.mar.per)
prop.mar.diff <- prop.mar.per.df %>%
group_by(Var1) %>%
summarize(prop.gap = Freq[Var2 == "Male"] - Freq[Var2 == "Female"])
prop.mar.diff## # A tibble: 5 x 2
## Var1 prop.gap
## <fct> <dbl>
## 1 Never married 5.95
## 2 Married -0.25
## 3 Separated -2.02
## 4 Divorced -2.84
## 5 Widowed -0.83
prop.mar.obj2 <- ggplot(data = prop.mar.diff, aes(y = prop.gap, x = Var1, fill = I("steelblue"))) +
theme(legend.title = element_blank())
prop.mar.obj2 + geom_bar(stat = "identity") +
xlab("Marital Status") +
ylab("Percentage difference") +
ggtitle("Proportional representation per marital status, male - female") The bar chart above shows that men are more strongly represented among those respondents that have never been married, while women predominate in every other category, albeit by only a slight (<3%) margin. Our second alternative hypothesis proposed that the wage gap observed between men and women might be partially explained on account of a larger proportion of women being married relative to men. This analysis provides very weak evidence for that hypothesis since approximately 6% more of male respondents were single (never married) and women were very slightly (<.25%) more likely to be married. Likewise, a larger proportion of female respondents tended to be either separated or divorced than male respondents. Insofar as these categories correlate positively with shared incomes and/or having children, then our hypothesis would receive somewhat stronger evidence in its favor.
# Graph wage gap by Marital Status
wage.gap.marstat <- nlsy %>%
group_by(marital_status) %>%
summarize(income.gap = mean(income[gender == "Male"], na.rm = TRUE) -
mean(income[gender == "Female"], na.rm = TRUE),
upper = t.test(income ~ gender)$conf.int[1],
lower = t.test(income ~ gender)$conf.int[2],
is.significant = as.numeric(t.test(income ~ gender)$p.value < 0.05))## Warning: Factor `marital_status` contains implicit NA, consider using
## `forcats::fct_explicit_na`
# Re-order the gender factor according to gap size
wage.gap.marstat <- mutate(wage.gap.marstat, marital_status = reorder(marital_status, income.gap))
# Plot, with error bars
ggplot(data = wage.gap.marstat, aes(x = marital_status, y = income.gap, fill = is.significant)) +
geom_bar(stat = "identity") +
xlab("Marital status") +
ylab("Income gap ($)") +
ggtitle("Income gap between men and women, by marital status") +
guides(fill = FALSE) +
geom_errorbar(aes(ymax = upper, ymin = lower), width = 0.1, size = 1) +
theme(text = element_text(size=12)) ## # A tibble: 6 x 5
## marital_status income.gap upper lower is.significant
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Never married 3538. -1294. 8370. 0
## 2 Married 37938. 34068. 41807. 1
## 3 Separated 6968. 106. 13830. 1
## 4 Divorced 10865. 4702. 17028. 1
## 5 Widowed 7650. -10473. 25772. 0
## 6 <NA> 11267. 3720. 18814. 1
The bar chart above shows that there is a statistically significant difference in the incomes of men and women for the separated, divorced, and married categories, but no statistically significant difference for the never married and widowed categories. The most notable difference by a wide margin is for the married category, where the mean difference in income between men and women is $37937.5157642, in favor of men.
However, since men and women are represented roughly equally within this category (refer to previous section’s analysis), this disparity does not help to explain why men tend to earn higher incomes than women.
# Family size tables and graphs
# analyze mean income by family size
nlsy.faminc <- nlsy %>%
group_by(gender, family_size) %>%
summarize(count = n(), income = mean(income, na.rm = TRUE))
# Kable
kable(nlsy.faminc)| gender | family_size | count | income |
|---|---|---|---|
| Male | 1 | 985 | 35378.04 |
| Male | 2 | 985 | 49467.98 |
| Male | 3 | 687 | 58505.22 |
| Male | 4 | 548 | 77328.13 |
| Male | 5 | 230 | 76211.10 |
| Male | 6 | 57 | 65517.04 |
| Male | 7 | 19 | 23473.68 |
| Male | 8 | 9 | 21111.11 |
| Male | 9 | 2 | 17500.00 |
| Male | 10 | 1 | 0.00 |
| Male | 11 | 1 | 114000.00 |
| Male | NA | 2879 | NaN |
| Female | 1 | 771 | 28023.20 |
| Female | 2 | 1235 | 30690.20 |
| Female | 3 | 853 | 31567.35 |
| Female | 4 | 545 | 29494.64 |
| Female | 5 | 232 | 26978.25 |
| Female | 6 | 86 | 21318.51 |
| Female | 7 | 32 | 15062.50 |
| Female | 8 | 10 | 30720.00 |
| Female | 9 | 5 | 14800.00 |
| Female | 10 | 2 | 66170.50 |
| Female | 11 | 2 | 3500.00 |
| Female | 12 | 3 | 0.00 |
| Female | 16 | 1 | 27000.00 |
| Female | NA | 2506 | NaN |
# Histogram
nlsy.faminc.obj <- ggplot(data = nlsy.faminc, aes(y = income, x = family_size, fill = gender)) +
theme(legend.title = element_blank())
nlsy.faminc.obj + geom_bar(stat = "identity", position = position_dodge(preserve = "single")) +
xlab("Family size") +
ylab("Average income ($)") +
ggtitle("Average income per family size, by gender")## Warning: Removed 2 rows containing missing values (geom_bar).
The bar chart above shows a positive relationship between income and family size for men up to approximately a family size of 4, while for women, no such relationship exists. Rather, women’s income appears to be roughly flat up to a family size of 4 and then, like men, begins to drop. The two large bars at the higher end of the distribution for family size represent just a few outliers and likely does not generalize to the larger population of our sample.
Across almost all categories of family size, we again observe an advantage in income for men. This trend is consistent with our second alternative hypothesis, which proposed that as couples decide to start families, the two partners in the relationship engage in a distribution of labor strategy which allows men to continue advancing in their careers through the family building process, while women’s careers stagnate. This is not the only interpretation for what we view here, but it is one possible interpretation.
# Calculate proportional representation of genders per family size
prop.fam <- prop.table(table(nlsy$family_size, nlsy$gender), margin = 2)
prop.fam.per <- round(prop.fam * 100, 2)
kable(prop.fam.per)| Male | Female | |
|---|---|---|
| 1 | 27.95 | 20.41 |
| 2 | 27.95 | 32.70 |
| 3 | 19.49 | 22.58 |
| 4 | 15.55 | 14.43 |
| 5 | 6.53 | 6.14 |
| 6 | 1.62 | 2.28 |
| 7 | 0.54 | 0.85 |
| 8 | 0.26 | 0.26 |
| 9 | 0.06 | 0.13 |
| 10 | 0.03 | 0.05 |
| 11 | 0.03 | 0.05 |
| 12 | 0.00 | 0.08 |
| 16 | 0.00 | 0.03 |
# Graph proportional representation of genders per family size
prop.fam.df <- as.data.frame(prop.fam)
prop.fam.obj1 <- ggplot(data = prop.fam.df, aes(y = Freq, x = Var1, fill = Var2)) +
theme(legend.title = element_blank())
prop.fam.obj1 + geom_bar(stat = "identity", position = 'dodge') +
xlab("Family size") +
ylab("Proportion") +
ggtitle("Proportional representation per family size, by gender") # Graph difference in proportions
prop.fam.per.df <- as.data.frame(prop.fam.per)
prop.fam.diff <- prop.fam.per.df %>%
group_by(Var1) %>%
summarize(prop.gap = Freq[Var2 == "Male"] - Freq[Var2 == "Female"])
prop.fam.diff## # A tibble: 13 x 2
## Var1 prop.gap
## <fct> <dbl>
## 1 1 7.54
## 2 2 -4.75
## 3 3 -3.09
## 4 4 1.12
## 5 5 0.39
## 6 6 -0.660
## 7 7 -0.310
## 8 8 0
## 9 9 -0.07
## 10 10 -0.02
## 11 11 -0.02
## 12 12 -0.08
## 13 16 -0.03
prop.fam.obj2 <- ggplot(data = prop.fam.diff, aes(y = prop.gap, x = Var1, fill = I("steelblue"))) +
theme(legend.title = element_blank())
prop.fam.obj2 + geom_bar(stat = "identity") +
xlab("Family size") +
ylab("Percentage difference") +
ggtitle("Proportional representation per family size, male - female") The bar chart above shows that men are more strongly represented in the lower ends of the distribution for family size (family size = 0-1), while women are more strongly represented for family sizes of 2-3. In the higher ends of the distribution (family size >= 4), the proportional representation of men and women across the different categories of family size is roughly balanced, with the difference in representation vascillating sllightly between men and women up to family size of seven, and then effectively flattening out for family sizes larger than 7.
This pattern is consistent with our second alternative hypothesis, insofar as it suggests that more men in our study had no or very small families than did women. We might expect having smaller families to provide an advantage to men in terms of income earning potential since they are able to focus more of their attention on advancing their careers.
None of the analyses of this section provide decisive evidence in favor of our hypothesis, and in general, the evidence it does provide is pretty modest. In the next section, we’ll look at the wage gap between men and women for the different categories of family size to try to gain a bit more clarity on the magnitude of the advantage that family size might provide in terms of an individual’s income earning potential.
# Calculate wage gap by level of family size
wage.gap.fam <- nlsy %>%
group_by(family_size) %>%
summarize(count = n(), male = mean(income[gender == "Male"], na.rm = TRUE),
female = mean(income[gender == "Female"], na.rm = TRUE),
income.gap = mean(income[gender == "Male"], na.rm = TRUE) -
mean(income[gender == "Female"], na.rm = TRUE))
# Kable
kable(wage.gap.fam)| family_size | count | male | female | income.gap |
|---|---|---|---|---|
| 1 | 1756 | 35378.04 | 28023.20 | 7354.846 |
| 2 | 2220 | 49467.98 | 30690.20 | 18777.782 |
| 3 | 1540 | 58505.22 | 31567.35 | 26937.878 |
| 4 | 1093 | 77328.13 | 29494.64 | 47833.484 |
| 5 | 462 | 76211.10 | 26978.25 | 49232.842 |
| 6 | 143 | 65517.04 | 21318.51 | 44198.531 |
| 7 | 51 | 23473.68 | 15062.50 | 8411.184 |
| 8 | 19 | 21111.11 | 30720.00 | -9608.889 |
| 9 | 7 | 17500.00 | 14800.00 | 2700.000 |
| 10 | 3 | 0.00 | 66170.50 | -66170.500 |
| 11 | 3 | 114000.00 | 3500.00 | 110500.000 |
| 12 | 3 | NaN | 0.00 | NaN |
| 16 | 1 | NaN | 27000.00 | NaN |
| NA | 5385 | NaN | NaN | NaN |
# Graph wage gap by level of educational attainment
ggplot(data = wage.gap.fam, aes(x = family_size, y = income.gap, fill = I("steelblue"))) +
geom_bar(stat = "identity") +
xlab("Family size") +
ylab("Income gap ($)") +
ggtitle("Income gap between men and women, by size of family")## Warning: Removed 3 rows containing missing values (position_stack).
The bar chart above shows more clearly the magnitude of the wage gap between men and women across different categories of family size. As previously observed, men earn higher incomes on average for all famiy sizes up to a family size of 7. The high bars at the higher extremes of the distribution for family size represent outliers and do not likely represent trends that are generalizable to the more general population.
Notably, the wage gap between men and women shrinks significantly for larger family sizes (family size between 7-9). This pattern may simply be an artifact of having smaller sample sizes for these categoties (and therefore, larger margins of errors), or it may indicate that having families of this size suppress wages for both men and women equally, although this explanation seems somewhat unlikely. A more plausible explanation may be that larger families tend to correlate positively with age and professional experience, factors that are associated with larger wages for both men and women.
Referring back to the table, we see that, indeed, sample sizes are much smaller for these categories of family size (increasing our margin of error) and wages on average are lower for both men and women. This analysis is most consistent with the first two explanations above, but largely rules out the third explanation.
# Spouse's income tables and charts
# Analyze mean income by spouse's income
nlsy.spouseinc <- nlsy %>%
group_by(gender) %>%
summarize(count = n(), income = mean(income, na.rm = TRUE), spouse_income = mean(spouse_income, na.rm = TRUE))
# Kable
kable(nlsy.spouseinc)| gender | count | income | spouse_income |
|---|---|---|---|
| Male | 6403 | 53445.91 | 31076.38 |
| Female | 6283 | 29538.51 | 56380.51 |
# Scatter plot
nlsy.spouseinc.obj <- ggplot(data = nlsy, aes(x = spouse_income, y = income, colour = gender)) +
xlab("Spouse's income ($)") +
ylab("Income ($)") +
ggtitle("Income per gender, by spouse's income") +
theme(legend.position = "none")
nlsy.spouseinc.obj + geom_point() + facet_grid(. ~ gender) + stat_smooth(colour = "black") ## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9192 rows containing non-finite values (stat_smooth).
## Warning: Removed 9192 rows containing missing values (geom_point).
The scatter plots above show men and women’s income plotted against their spouses’ income. When a smoothed curve is added to represent how income varies with spouse’s income, we see markedly different trends for men and women. For men, personal income appears to be positively correlated with their spouse’s income at the higher ranges of the distribution for spouse’s income, while for women, the relationship is almost flat throughout the full range of the distribution. In other words, womens’ income stays about the same on average regardless of how much their spouses earn, while men’s income appears to drop slightly as their spouse’s income increases up to about $30,000, but then increases steadily as their spouse’s incomes increase above $30,000.
This analysis does not provide evidence for the proposal made by our second alternative hypothesis, which proposed that women’s income may be lower than men’s in part because they are strategically distributing the caretaking and “breadwinning” responsibilities with their spouses. If that were happenining to a significant extent, we would expect women’s wages to decrease slightly on average as their spouse’s income increases as some women dropped out of the work force to focus on parenting. Instead, what we find is that women’s incomes tend to increase along with their spouse’s income up about x=$150,000 and decline only when their spouse’s income exceeds $150,000.
This interpretation is slightly more consistent with what we see happening in the case of men, however, at least for lower income couples - that is, men tend to have lower incomes as their spouse’s income increase up to about x=$40,000, possibly reflecting the effects of strategic income sharing to balance household responsibilities (although this is certainly not the only explanation for what we see). Among higher income, couples, however, both partners’ incomes seem to increase together.
# Run regression
nlsy.lm <- lm(income ~ gender + industry + highest_grade + marital_status + spouse_income, data = nlsy)
# Output summary
summary(nlsy.lm)##
## Call:
## lm(formula = income ~ gender + industry + highest_grade + marital_status +
## spouse_income, data = nlsy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -174383 -27993 -5721 15250 332390
##
## Coefficients:
## Estimate
## (Intercept) 50408.63332
## genderFemale -38549.57669
## industryAgriculture, Forestry, Fishing, and Hunting -17573.60033
## industryMining 5533.53835
## industryUtilities 16315.67927
## industryConstruction -12613.21141
## industryManufacturing 7542.68076
## industryWholesale Trade 16255.21939
## industryRetail Trade -8288.20265
## industryTransportation and Warehousing 1899.73611
## industryInformation -8530.26504
## industryFinance and Insurance 22446.96423
## industryReal Estate and Rental and Leasing 127.48170
## industryProfessional, Scientific, and Technical Services 14724.82624
## industryManagement, Administrative and Support, and Waste Management Services -13202.61740
## industryEducational Services -20881.38251
## industryArts, Entertainment, and Recreation -26188.22396
## industryAccomodations and Food Services -4452.80819
## industryOther Services (Except Public Administration -17688.01379
## industryPublic Administration and Active Duty Military -3262.75178
## industryArmed Forces -35431.90704
## industryUncodeable 703.97973
## highest_gradeNone -21140.18910
## highest_grade3rd grade -19725.39031
## highest_grade4th grade -38391.20419
## highest_grade5th grade -28408.94459
## highest_grade6th grade -22375.67412
## highest_grade7th grade -21874.45298
## highest_grade8th grade -19361.64800
## highest_grade9th grade -15572.74528
## highest_grade10th grade -15232.66764
## highest_grade11th grade -8061.94008
## highest_grade1st year college 8905.06696
## highest_grade2nd year college 11454.94653
## highest_grade3rd year college 15517.66234
## highest_grade4th year college 41690.83978
## highest_grade5th year college 41089.50464
## highest_grade6th year college 66760.02367
## highest_grade7th year college 69880.27751
## highest_grade8th year college or more 98193.07651
## marital_statusMarried 9007.20721
## marital_statusSeparated -2014.48258
## marital_statusDivorced 2599.34044
## marital_statusWidowed -1344.96715
## spouse_income 0.01957
## Std. Error
## (Intercept) 5767.07279
## genderFemale 2327.78847
## industryAgriculture, Forestry, Fishing, and Hunting 9476.83840
## industryMining 12557.53168
## industryUtilities 9547.29445
## industryConstruction 5117.06663
## industryManufacturing 4162.62515
## industryWholesale Trade 6678.70338
## industryRetail Trade 4505.12703
## industryTransportation and Warehousing 5439.48091
## industryInformation 7101.72268
## industryFinance and Insurance 5657.77316
## industryReal Estate and Rental and Leasing 8654.90972
## industryProfessional, Scientific, and Technical Services 5172.94580
## industryManagement, Administrative and Support, and Waste Management Services 5682.71128
## industryEducational Services 4226.14491
## industryArts, Entertainment, and Recreation 10187.84149
## industryAccomodations and Food Services 6406.99666
## industryOther Services (Except Public Administration 5661.71730
## industryPublic Administration and Active Duty Military 4726.67509
## industryArmed Forces 20311.33298
## industryUncodeable 20298.34744
## highest_gradeNone 56854.98364
## highest_grade3rd grade 28587.54679
## highest_grade4th grade 33086.74481
## highest_grade5th grade 56871.00256
## highest_grade6th grade 19062.35620
## highest_grade7th grade 19077.69985
## highest_grade8th grade 11792.94330
## highest_grade9th grade 9288.12509
## highest_grade10th grade 9540.19675
## highest_grade11th grade 8594.04315
## highest_grade1st year college 3795.95729
## highest_grade2nd year college 3575.15019
## highest_grade3rd year college 4756.14441
## highest_grade4th year college 3160.68107
## highest_grade5th year college 5648.47115
## highest_grade6th year college 4776.04189
## highest_grade7th year college 7129.01326
## highest_grade8th year college or more 6256.67862
## marital_statusMarried 4762.31574
## marital_statusSeparated 8109.54455
## marital_statusDivorced 6008.68773
## marital_statusWidowed 14997.38040
## spouse_income 0.02328
## t value
## (Intercept) 8.741
## genderFemale -16.561
## industryAgriculture, Forestry, Fishing, and Hunting -1.854
## industryMining 0.441
## industryUtilities 1.709
## industryConstruction -2.465
## industryManufacturing 1.812
## industryWholesale Trade 2.434
## industryRetail Trade -1.840
## industryTransportation and Warehousing 0.349
## industryInformation -1.201
## industryFinance and Insurance 3.967
## industryReal Estate and Rental and Leasing 0.015
## industryProfessional, Scientific, and Technical Services 2.847
## industryManagement, Administrative and Support, and Waste Management Services -2.323
## industryEducational Services -4.941
## industryArts, Entertainment, and Recreation -2.571
## industryAccomodations and Food Services -0.695
## industryOther Services (Except Public Administration -3.124
## industryPublic Administration and Active Duty Military -0.690
## industryArmed Forces -1.744
## industryUncodeable 0.035
## highest_gradeNone -0.372
## highest_grade3rd grade -0.690
## highest_grade4th grade -1.160
## highest_grade5th grade -0.500
## highest_grade6th grade -1.174
## highest_grade7th grade -1.147
## highest_grade8th grade -1.642
## highest_grade9th grade -1.677
## highest_grade10th grade -1.597
## highest_grade11th grade -0.938
## highest_grade1st year college 2.346
## highest_grade2nd year college 3.204
## highest_grade3rd year college 3.263
## highest_grade4th year college 13.190
## highest_grade5th year college 7.274
## highest_grade6th year college 13.978
## highest_grade7th year college 9.802
## highest_grade8th year college or more 15.694
## marital_statusMarried 1.891
## marital_statusSeparated -0.248
## marital_statusDivorced 0.433
## marital_statusWidowed -0.090
## spouse_income 0.841
## Pr(>|t|)
## (Intercept) < 2e-16
## genderFemale < 2e-16
## industryAgriculture, Forestry, Fishing, and Hunting 0.06378
## industryMining 0.65949
## industryUtilities 0.08756
## industryConstruction 0.01376
## industryManufacturing 0.07008
## industryWholesale Trade 0.01499
## industryRetail Trade 0.06590
## industryTransportation and Warehousing 0.72693
## industryInformation 0.22978
## industryFinance and Insurance 7.43e-05
## industryReal Estate and Rental and Leasing 0.98825
## industryProfessional, Scientific, and Technical Services 0.00445
## industryManagement, Administrative and Support, and Waste Management Services 0.02023
## industryEducational Services 8.19e-07
## industryArts, Entertainment, and Recreation 0.01020
## industryAccomodations and Food Services 0.48711
## industryOther Services (Except Public Administration 0.00180
## industryPublic Administration and Active Duty Military 0.49007
## industryArmed Forces 0.08118
## industryUncodeable 0.97234
## highest_gradeNone 0.71005
## highest_grade3rd grade 0.49025
## highest_grade4th grade 0.24601
## highest_grade5th grade 0.61744
## highest_grade6th grade 0.24056
## highest_grade7th grade 0.25164
## highest_grade8th grade 0.10073
## highest_grade9th grade 0.09372
## highest_grade10th grade 0.11044
## highest_grade11th grade 0.34827
## highest_grade1st year college 0.01904
## highest_grade2nd year college 0.00137
## highest_grade3rd year college 0.00112
## highest_grade4th year college < 2e-16
## highest_grade5th year college 4.40e-13
## highest_grade6th year college < 2e-16
## highest_grade7th year college < 2e-16
## highest_grade8th year college or more < 2e-16
## marital_statusMarried 0.05867
## marital_statusSeparated 0.80383
## marital_statusDivorced 0.66534
## marital_statusWidowed 0.92855
## spouse_income 0.40063
##
## (Intercept) ***
## genderFemale ***
## industryAgriculture, Forestry, Fishing, and Hunting .
## industryMining
## industryUtilities .
## industryConstruction *
## industryManufacturing .
## industryWholesale Trade *
## industryRetail Trade .
## industryTransportation and Warehousing
## industryInformation
## industryFinance and Insurance ***
## industryReal Estate and Rental and Leasing
## industryProfessional, Scientific, and Technical Services **
## industryManagement, Administrative and Support, and Waste Management Services *
## industryEducational Services ***
## industryArts, Entertainment, and Recreation *
## industryAccomodations and Food Services
## industryOther Services (Except Public Administration **
## industryPublic Administration and Active Duty Military
## industryArmed Forces .
## industryUncodeable
## highest_gradeNone
## highest_grade3rd grade
## highest_grade4th grade
## highest_grade5th grade
## highest_grade6th grade
## highest_grade7th grade
## highest_grade8th grade
## highest_grade9th grade .
## highest_grade10th grade
## highest_grade11th grade
## highest_grade1st year college *
## highest_grade2nd year college **
## highest_grade3rd year college **
## highest_grade4th year college ***
## highest_grade5th year college ***
## highest_grade6th year college ***
## highest_grade7th year college ***
## highest_grade8th year college or more ***
## marital_statusMarried .
## marital_statusSeparated
## marital_statusDivorced
## marital_statusWidowed
## spouse_income
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56760 on 3067 degrees of freedom
## (9574 observations deleted due to missingness)
## Multiple R-squared: 0.2822, Adjusted R-squared: 0.2719
## F-statistic: 27.4 on 44 and 3067 DF, p-value: < 2.2e-16
# Print the names of the predictors whose coefficient estimates are statistically significant at the 0.05 level
sig.coef <- names(summary(nlsy.lm)$coef[summary(nlsy.lm)$coef[,4] <= .05, 4])In this section, I fit a linear regression model to the relationship between gender and my chosen key variables and interpret the model coefficients. As noted above, several variables included in my part 1 analysis, including jobs_number and family_size, were excluded from my final model due to weak association with the main variables of interest (gender and income) and/or difficulties with interpretability. It should be also noted that the analysis of this section relies on certain assumptions that will not be evaluated until the next section (part (b)), which will determine whether a linear regression is appropriate for modeling the relationship between income and these variables, and correspondingly, whether the standard interpretation of the coefficients is valid.
The first thing to note from the output summary above is that gender is a highly statistically significant predictor of income at a p-value of < 2e-16. Even holding industry, educational attainment, marital status, and spouse’s income constant, being female is assosiciated with a $-38549.58 difference in income compared to being male. Altogether,the statistically significant coefficient estimates in this model include (Intercept), genderFemale, industryConstruction, industryWholesale Trade, industryFinance and Insurance, industryProfessional, Scientific, and Technical Services, industryManagement, Administrative and Support, and Waste Management Services, industryEducational Services, industryArts, Entertainment, and Recreation, industryOther Services (Except Public Administration, highest_grade1st year college, highest_grade2nd year college, highest_grade3rd year college, highest_grade4th year college, highest_grade5th year college, highest_grade6th year college, highest_grade7th year college, highest_grade8th year college or more. Below I provide an interpretation of a select few of these significant variables.
For the interpretations that follow, the baseline for comparison is a male who has never been married, has a high school education (has completed 12 years of education), works in the area of health care and social assistance, and has a spouse with an income of $0. This is, of course, merely a hypothetical scenario and doesn’t necessarily (or actually) represent any individual from our sample population. For ease of interpretation, all subsequent mentions of “holding all other variables constant” should be understood to connote this particular collection of features, save only for the facts that (a) the individual being compared against this baseline is female (and therefore carries a starting salary $-38549.58 lower than the male baseline) and (b) differs in the one additional respect specified (i.e., that for which the coefficient is being interpreted).
Referring to the output summary above, we see that working in Educational Services is associated with a $-59430.96 difference in income compared to being male, holding all other variables constant. Working in Finance and Insurance is associated with a $-16102.61 difference in income compared to being male, holding all other variables constant. Working in Professional, Scientific and Technical Services is associated with a $-23824.75 difference in income compared to being male, holding all other variables constant.
Similarly, with regard to level of educational attainment, we see that having completed 4 years of college is associated with a $3141.26 difference in income compared to being male, holding all other variables constant, while having completed 8 or more years of college is associated with a $59643.5 difference in income compared to being male, holding all other variables constant.
The R-squared for this model is 0.28, meaning that approximately 28.22% of the variability in income is explained by knowing the values of our predictors.
Next, I turn to an evaluation of the linear regression model to better assess the validity of the interpretations provided in this section.
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
In this section, I discuss whether the standard diagnostic plots indicate issues with a linear regression model for gender and my chosen key variables. The issues I’m looking for include such things as trends in residuals, variance issues, outliers, etc.
First of all, when we plot the distribution of residuals for this relationship, we see that both the linearity assumption (i.e., that the residuals look like random scatter around the zero line and there is no evidence of structure or pattern in the residuals) as well as the homoscedasticity assumption (i.e., that there is equal variance in the deviance of each y value from the fitted line) is violated, meaning the relationship cannot be appropriately modeled by a linear regression. Instead, the data have a tendency to concentrate below where the fitted line would predict them to appear, indicating that the mean of the observed values is consistently lower than what we would expect if the relationship were linear.
The assumption of homoscedacity is somewhat more difficult to assess, but there appear to be slight variations in the deviance of y values at certain x values, specifically below x values of about 25,000. There also seems to be significant left to right fanning when we focus just on where the data is most densely concentration, although this fanning is clearly constrained by the rigid bottom limit, and to a somewhat lesser degree, the upper limit as well (likely a consequence of our topcoded values). In light of this analysis, we can conclude that any linear regression model of this model is going to be significantly limited in its predictive power.
The Scale-Location plot provides a more fine-tuned tool for assessing the assumption of homoscedastiticy, i.e., equal variance in the deviance of each y values from the fitted line. If this assumption were upheld, the red line running through the points would be approximately flat in the horizontal direction. However, that’s not what we see in our plot, indicating that we do not have equal variance in the deviation of our y values from the fitted line across all values of x. This analysis further supports our conclusion from above that a linear model of this relationship is not appropriate.
Next, we consider the Q-Q-plot. Q-Q plots take the sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution (https://data.library.virginia.edu/understanding-q-q-plots/). The superimposed line represents where the data would be expected to fall if its underlying distribution was normal. The Q-Q-plot above shows that the sample data does not conform to a normal distribution, as indicated by the sharp deviation of observed values above the 1.5 quantile values. There is also slight skewing at the lower end of the distribution, although this appears to be within an acceptable range.
Finally, the Residuals vs. Leverage plots allows us to identify influential data points in our model. The points we’re most concerned about are values in the upper right or lower right corners, which are outside the red dashed Cook’s distance line. These are points that would be influential in the model, possibly distorting our estimations. If such points were present, we’d want to consider removing them in order to get more accurate estimates from our model (https://medium.com/data-distilled/residual-plots-part-4-residuals-vs-leverage-plot-14aeed009ef7). It appears as though there may be one point (labeled 2681) that lies either on or just outside this boundary, suggesting our model might be improved by removing it. Because it’s on the cusp, however, I’ve decided to leave it.
In this section, I use the anova() function to assess whether each of the variables included in my model is a statistically significant predictor of income by comparing my model from part (a) to a model where each of these variables is excluded.
# Use the anova() function to assess whether industry is a statistically significant predictor of income
anova(nlsy.lm, update(nlsy.lm, . ~ . - industry, data = na.omit(nlsy[, all.vars(formula(nlsy.lm))])))## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + highest_grade + marital_status + spouse_income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3087 10337872794679 -20 -457711543921 7.1041 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use the anova() function to assess whether highest_grade is a statistically significant predictor of income
anova(nlsy.lm, update(nlsy.lm, . ~ . - highest_grade))## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + marital_status + spouse_income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3085 11647196744774 -18 -1767035494015 30.474 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use the anova() function to assess whether marital_status is a statistically significant predictor of income
anova(nlsy.lm, update(nlsy.lm, . ~ . - marital_status, data = na.omit(nlsy[, all.vars(formula(nlsy.lm))])))## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + highest_grade + spouse_income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3071 9907054362983 -4 -26893112225 2.087 0.0799 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use the anova() function to assess whether spouse's income is a statistically significant predictor of income
anova(nlsy.lm, update(nlsy.lm, . ~ . - spouse_income, data = na.omit(nlsy[, all.vars(formula(nlsy.lm))])))## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3068 9882437635679 -1 -2276384921 0.7066 0.4006
First, I apply the anova function to the industry variable. From this analysis, we see that industry is highly statistically significant at a p-value of 6.883115910^{-20}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all industries. In other words, the data suggests that the income gap between men and women varies with the industry in which one works.
When I apply the anove function to the highest_grade variable, we see that it, too, is highly statistically significant at a p-value of 5.90507810^{-96}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with level of education.
When I apply the anove function to the marital_status variable, however, we find that it is not statistically significant, having a p-value of 0.0799031. If a linear regression were appropriate for modeling this relationship, we would therefore not be able to reject the null hypothesis that the income gap is same across all levels of marital status. In other words, we do not have evidence that the income gap between men and women varies depending on a person’s marital status.
Finally, when I apply the anove function to the spouse_income variable, we see that it, too, is not statistically significant, having a p-value of 0.4006285. If a linear regression were appropriate for modeling this relationship, we would not be able to reject the null hypothesis that the income gap is same across all levels of spouse’s income. In other words, the data does not indicate that the income gap between men and women varies with their spouse’s income.
# Update your linear regression model from part (a) to also include an interaction term between highest_grade and gender
nlsy.lm.interact1 <- update(nlsy.lm, . ~ . + industry * gender)
# Update your linear regression model from part (a) to also include an interaction term between marital_status and gender
nlsy.lm.interact2 <- update(nlsy.lm, . ~ . + highest_grade * gender)
# Update your linear regression model from part (a) to also include an interaction term between spouse_income and gender
nlsy.lm.interact3 <- update(nlsy.lm, . ~ . + marital_status * gender)
# Use the anova() function to assess whether interaction term 1 is a statistically significant predictor of income
anova(nlsy.lm, nlsy.lm.interact1)## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status +
## spouse_income + gender:industry
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3047 9760960253620 20 119200997138 1.8605 0.0114 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use the anova() function to assess whether interaction term 2 is a statistically significant predictor of income
anova(nlsy.lm, nlsy.lm.interact2)## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status +
## spouse_income + gender:highest_grade
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3052 9618402780284 15 261758470474 5.5372 2.852e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Use the anova() function to assess whether interaction term 3 is a statistically significant predictor of income
anova(nlsy.lm, nlsy.lm.interact3)## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status +
## spouse_income + gender:marital_status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3063 9848920220902 4 31241029856 2.429 0.04573 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this study, we are exploring the question of whether there are any factors that exacerbate or mitigate the income gap between men and women. This is different from asking whether there are factors that affect income. In the latter case, we are only estimating the so-called main effects of each variable, while in the latter case, we are measuring the interaction effects, i.e., the effect that emerges when two variables appear together. The specific interaction effects that we’re interested in are those combining gender and one of our other chosen variables. By looking at the individual p-values for the interaction term coefficients, we can answer the question of whether the difference in income gap differs across different levels of these key variables (http://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture11/lecture11-94842.html).
The p-value for the first interaction variable (industry * gender) is statistically significant at a value of 0.0114029. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all industries. In other words, the data suggests that the income gap between men and women varies with industry.
The p-value for the second interaction variable (highest_grade * gender) is statistically significant at a value of 2.852343410^{-11}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with educational attainment.
The p-value for the third interaction variable (marital_status * gender) is also statistically significant at a value of 0.0457257. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with educational attainment.
This section details the approach I took in exploring and analyzing the data. Here, I tell the story of how I got to main conclusions and address some of the twists and turns I encountered along the way.
Negative values were used in the NLSY 1979 Cohort Survey to indicate various non-response scenarios, including refusal to answer (-1), don’t know (-2), invalid skip (-3), valid skip (-4), and non-interview (-5). For the purposes of this analysis, these values were replaced with NA values to preserve the integrity of statistical summaries. This approach, while imperfect, was judged to be less problematic than any of the alternative approaches, such as systematically assigning non-responses to one or more of the other response categories, which would rely on assumptions I didn’t feel adequately equipped to justify.
Each of these responses has potentially different significance to our analysis depending on the research question. For example, I found that male respondents were significantly more likely than female respondent to refuse to report their level of educational attainment. While it’s possible that individuals responding in this way would have had roughly the same distribution across the various categories as those that did respond, it’s perhaps more likely that the no-response bin contained a greater proportion of low-attainment individuals than those that provided responses and were included (due, perhaps to the embarrassment of disclosing low educational attainment or fear of negative outcomes). If this assumption were correct, then a disparity in the numbers of non-responders would be expected to bias the model of the relationship between income and educational attainment upward relative to the true relationship.
Omitted data is more likely to bias our models whenever they represent large proportions of sample responses for that question and are more problematic when they are non-random (as assumed for the case above) than when they’re random. Such omissions lead to classical and non-classical measurement error, which introduce bias to our models and inaccurate interpretations of model coefficients.
Similarly, omitted data can negatively impact the validity of the resulting analysis by introducing selection bias. As noted above, the decision not to reply to a particular question might indicate some important difference between the responding and non-responding portions of the population (e.g., mistrust of the interviewer, low self-esteem, etc.). Such non-random differences between segments of the sample population would therefore act as a confounder and compromise the internal validity of our study.
In the NLSY79 Survey data, the variable that serves as the outcome for this analysis, TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (TRUNC) (2012 survey question), was topcoded, meaning we do not get to see the actual incomes for the top 2% of earners. Survey data are often topcoded before release to the public to preserve the anonymity of respondents and to prevent possibly-erroneous outliers from being published (see “Top-coded,” Wikipedia). For the purposes of this analysis, I chose to retain the top coded data primarily for the purpose of excluding outliers. Even if these data aren’t erroneous (i.e., even if they correspond to actual respondents), the presence of outliers can skew the data and obscure more generalizable patterns and trends.
One worry with this approach is that, by exlcuding extreme outliers, we actually risk concealing one of the most notable trends in the relationship between gender and income, namely the lack of representation of women in the highest paying (especially executive level) positions. While this is indeed a weakness with the approach, I felt that this might represent a somewhat unique case with a slightly different causal story than the one that applies more generally to the population.
In my evaluation of the second alternative hypothesis, I did several analyses examining the relationship between income and number of jobs held by respondents. I assumed, perhaps mistakenly, that this variable could be used as proxy for experience, and therefore, would correlate positively with income. However, the data did not support this assumption, suggesting that the variable may not have measured what I expected it to.
My analysis of the relationship between income and family size also turned out not to be as informative as I expected. In this case, I took family size to be a proxy for the number of children the respondent had, although the metadata did not make this explicit. In whatever case, I wasn’t able to discern much meaning in my analysis of that relationship, particularly in reference to the hypothesis I enlisted it to test.
Finally, I expected my analysis of the relationship between industry and income to shed more light on the wage gap between men and women than it did. While I was able to extract some interesting insights from that analysis, I felt that other interesting insights were still obscured at that level of generality. In order to get at these insights (or at least to satisfy my curiosity), I would have also liked to include a more granular analysis of the proportional representation of each gender in different professions using the Occupation variable from the base data set.
I investigated a number of relationships that, for various reasons, don’t appear in my findings sections. One of these is the relationship between race and income. I created several tables and plots to examine the effect that race had on the wage gap observed between men and women and found that, indeed, a woman’s race had a substantial impact on how large a difference in income they were likely to experience relative to their male counterparts. While interesting in its own right, I didn’t feel like this analysis shed much light on the central question of this study, which asked whether gender alone influenced how much a person earned in income. The primary utility of the race analysis for my purposes was to rule out the possibility that race was serving as a confounder in the relationship between gender and income, which we would have expected in the case that male and female respondents were uneqully distributed among the race categories. This, however, turned out not to be the case, so I decided to exclude the analysis from the report of my findings.
I also ran several analyses looking specifically at high wage earners (individuals earning above $50,000/yr). In the end, I felt this analysis was too narrow in its focus and detracted from the more general trends I was interested in exploring.
Finally, because my analyses of the jobs_number and family_size variables didn’t yield the depth of insight I was hoping, I decided to exclude them from my findings as well.
For this study, I set out to look for evidence that might refute The null hypothesis that there is no significant difference in income between men and women and that, rather, any difference observed is completely explainable by differences in other factors between these two groups. In my final analysis, I chose to focus on the two alternative hypotheses I felt told the most comprehensive story about why men and women might earn different incomes - that is, if, in fact, those differences weren’t the result of discriminatory practices. In other words, I wanted to explore the possibility that the differences observed were due to factors other than the person’s gender, and to examine these factors in the broader context I would expect to find them.
The first of these alternative hypotheses is that men and women earn different incomes on average because of their professional qualifications and/or occupational choices. According to this account, men and women will tend to earn different incomes on account of one gender having higher qualifications in terms of either higher educational attainment or more professional experience (or both), which is rewarded with higher-paying positions, and/or on account of one gender choosing to work in industries with lower-than-average prospective salaries. All else held equal, I would expect both of these mechanisms to work in the same direction resulting in a larger wage gap between the genders than if only one of them were operative, or if they favored the genders disparately (e.g., the first favoring women and the second favoring men, or vice versa).
To test this hypothesis, I looked at the proportional representation of each gender across the different levels of educational attainment (highest grade completed), number of jobs worked (job numbers), and industry of employment (industry). I also ran a multiple linear regression on the key variables, controlling for all other factors. This allowed me to estimate the magnitude of the association between each of these variables independently and whether it was statistically significant at the p<0.05 level. (For the final analysis, my examination of the relationship between income, gender, and jobs_number was excluded for reasons discussed elsewhere in this report.)
The second alternative hypothesis I chose to test is that the difference in income between men and women is due to family dynamics. According to this account, any differences in income observed between men and women are due to decisions individuals make cooperatively with some other member of their family unit. The most common example I would expect to see here is when two partners, whether explicitly or implicitly, adopt a strategy of distributing the essential functions of the family between themselves, with one taking a greater responsibility for the family’s finances (the so-called “breadwinner” role) and the other taking a greater responsibility for parenting duties and maintaining the home. I would expect this pattern to be most pervasive among respondents who are married with larger families. I would also expect this strategy to emerge in relationships in which one spouse earns a relatively high income (>$50k annually),
In order to test this hypothesis, I analyzed the proportional representation of each gender across the various categories of marital status, family size, and spouse’s income. I also ran a multiple linear regression on the key variables, controlling for all other factors. This allowed me to estimate the magnitude of the association between each of these variables independently and whether it was statistically significant at the p<0.05 level. (For the final analysis, my examination of the relationship between income, gender, and family_size was excluded for reasons discussed elsewhere in this report.)
With several variables I examined, I was unable to perform the analyses I initially intended, mostly due to insufficient or missing data. The main consequence of this difficulty was that I had to substitute the more direct analysis I intended, which would have assessed the influence of the variable directly on the male-female wage gap, for something more roundabout. The approach I ultimately settled on was to break this analysis into two separate steps. In the first step, I calculated the relationship between the variable of interest and income, without special regard to gender. In the second step, I examined the proportional representation of the genders across the various levels of the variable, watching out for any imbalances that might distort the apparent relationship between that predictor and the outcome. While not ideal, this strategy allowed me to draw inferences as to whether the wage gap between men and women may be exaggerated, or even completely fabricated, by systematic differences in terms of these other factors between the two genders. In short, it allowed me to assess the likelihood that these other factors were acting as confounders in the analysis of the relationship between gender and income.
In this section, I provide a careful presentation of my main findings concerning the problem of income inequality between men and women.
The first bar chart above shows that the highest paying industries, on average, are Finance and Insurance, Professional, Scientific, and Technical Services, Information, and Utilities while the lowest paying industries are Management, Administrative Support, and Waste Management Services, Construction, and Accommodations and Food Services.
A side-by-side comparison of average income across industries by gender shows a slight-to-substantial advantage for men across most industries. This analysis suggests that the wage gap between men and women is not due to differences in choice of industry between the two groups, insofar as men tend to outearn women independently of what industry they’re in. The few exceptions are in the areas of Real Estate and Rental and Leasing and the Armed Forces, where women slightly outearn men on average.
Even in those areas where women are more strongly represented, such as Health Care and Social Assistance and Educational Services, men still tend to earn more on average (see next section’s analysis).
The bar chart above shows that the representation gap between men and women varies across industries, with women being more strongly represented in such industries as Health Care and Social Assistance and Educational Services and men being more strongly represented in Construction and Manufacturing. If those industries for which men were more strongly represented also tended to correspond to higher salaries on average, then this disparity might partially explain the wage gap we observe between men and women. If there is no such correlation, however, then this would not be a likely explanation for the gap we observe.
Referring back to the previous section’s analysis, we find no such correlation between high paying professions and representativeness. Men predominate in two out of the three lowest paying industries noted above, while women predominate in the highest paying industry, Finance and Insurance, are equally represented in the second highest paying industry, and are only slightly underrepresented in the remaining two highest-paying industries. While this analysis doesn’t eliminate the possibility that people’s choice of industry contributes to the wage gap we observe between men and women, it does somewhat weaken the case in favor of that explanation.
In the final section of my analysis of the relationship between income and industry, I compare the wage gap within each industry to further validate the results of the previous sections’ analyses.
## Warning: Removed 1 rows containing missing values (position_stack).
The bar chart above shows that wage gap between men and women is far more pronounced in certain industries than in others, with most of these disparities favoring men. The widest disparities in earnings are in the areas of Finance and Insurance, Professional, Scientific and Technical Services, and Information, while the smallest disparities are in the areas of Construction, Real Estate and Rental and Leasing, and Armed Forces, with the latter two categories tending to favor women. In other words, in the industries in which women have an advantage, that advantage tends to be very modest, where the advantage is much larger in those industries that favor men. This analysis provides further evidence that occupational choice is likely not driving the difference we observe in the wages of men and women, although it is an important qualifier, for the reasons just mentioned.
Next, let’s look at the relationship between income and educational attainment (highest_grade).
## Warning: Ignoring 5662 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning: Removed 2 rows containing missing values (geom_bar).
In the boxplot provided above, you can see the generally positive effects of additional years of education on average income earned. The boxplot also allows us to see how additional years of education influence the range of incomes that become accessible to people in each class. Many of the higher income categories (e.g., above $100k/year) are reserved almost excusively for those possessing at least a high school diploma (i.e., completed up to 12 years of education). Around the $350k mark, you can see the top-coded values of those earning significantly more than the bulk of the distribution for each class. We can therefore assume that the real average for these higher educational levels is actually somewhat higher than what is displayed, although such outcomes are rare.
The bar chart similarly shows a positive correlation between level of education attainment and average income. With a few minor exceptions, average income tends to increase with every additional level of educational attainment, for both men and women. There are a few minor deviations from this trend among levels of grade school as well as college, but some of these differences likely fall within the margin of error for those measurements, so should not be interpreted as significant. Among the major classes of educational attainment, e.g., from grade school to a bachelors degree, and between different levels of higher education, the difference is much more significant.
Notably, the positive effect of educational attainment on income is much more pronounced for men than women, a pattern that holds across virtually every category of education. The sole exception is for those with a 4th grade education, though again, this difference is likely within the margin of error for this category (n = 9), and therefore should not be interpreted as significant.
The two bar charts above show the male-to-female proportional representation per level of educational attainment and male - female difference in proportional representation per level of educational attainment, respectively. While male and female respondents were represented nearly equally across all categories, the slight differences that do exist are telling. Specifically, we find that men are more strongly represented among those who completed up to some high school (9th to 12th grade) and 8 or more years of college, while women are more strongly represented among those who completed up to some college (1 to 7 years). In other words, female respondents were on average better educated than men across the entire sample.
This analysis, like the previous one, provides evidence against the first alternative hypothesis, which proposed that women may be earning lower incomes because of lower educational attainment compared to men. In fact, what this analysis shows is that men are earning more despite having lower educational qualifications than their female counterparts, which is precisely the opposite of what this hypothesis predicted.
In the final section of my analysis of the relationship between income and professional qualifications, I compare the wage gap within each level of educational attainment to further validate the results of the previous sections’ analyses.
## Warning: Removed 2 rows containing missing values (position_stack).
The bar chart above shows that wage gap between men and women increases as level of educational attainment increases, in favor of men. We see slight drops at irregular intervals, such as 5 years and 7 years of college, which might represent individuals who stopped short of completing a higher level degree, such as a masters, doctorate, or professional degree. Alternatively, it may just represent a small sample size - and therefore larger margin of error - for these categories.
This analysis provides a first line of evidence that professional qualifications are likely not driving the difference we observe in the wages of men and women, insofar as men are benefiting more on average from the positive relationship between educational attainment and income, despite women having the stronger educational credentials on average. The last factor we’ll consider in our evaluation of the first alternative hypothesis is number of jobs, which is being used here as a proxy for professional experience.
## Warning: Ignoring 6177 observations
## Warning: Removed 5662 rows containing non-finite values (stat_ydensity).
The bar chart above shows the average income per category of marital status, separated by gender. You can see that men earn higher incomes on average than women across all categories of martial status. The difference is most pronounced among married individuals and least pronounced among those who have never been married. The first alternative hypothesis offers a possible explanation for this trend, which is that single individuals, whether male or female, are less likely to be burdened by the responsibilities of parenthood and therefore can devote more energy and attention to their careers, and perhaps even compete for more competitive high paying jobs. In contrast, married individuals are more likely both to have children, as well as to share incomes with their partners. Both of these factors - i.e., having larger families and sharing income with their spouse) would, according to this hypothesis, lead us to expect a decrease in the wages of women relative to men, as couples shift the burdens of parenthood disproportionately onto one partner to allow the remaining partner to fill the role of “breadwinner.” The analyses that follow will help evaluate whether the data suports this explanation.
The boxplot similarly shows higher median incomes for married respondents compared to all other categories, with the upper fence value reaching significantly higher than those of the other categories. The violin plot shows how the population is distributed throughout the various portions of the range for income, with married, divorced and widowed categories being the only categories to feature a somewhat even concentrations of the population into the higher income ranges, with all other categories tapering off pretty precipitously as incomes increase.
The bar chart above shows that men are more strongly represented among those respondents that have never been married, while women predominate in every other category, albeit by only a slight (<3%) margin. Our second alternative hypothesis proposed that the wage gap observed between men and women might be partially explained on account of a larger proportion of women being married relative to men. This analysis provides very weak evidence for that hypothesis since approximately 6% more of male respondents were single (never married) and women were very slightly (<.25%) more likely to be married. Likewise, a larger proportion of female respondents tended to be either separated or divorced than male respondents. Insofar as these categories correlate positively with shared incomes and/or having children, then our hypothesis would receive somewhat stronger evidence in its favor.
The bar chart above shows that there is a statistically significant difference in the incomes of men and women for the separated, divorced, and married categories, but no statistically significant difference for the never married and widowed categories. The most notable difference by a wide margin is for the married category, where the mean difference in income between men and women is $37937.5157642, in favor of men.
However, since men and women are represented roughly equally within this category (refer to previous section’s analysis), this disparity does not help to explain why men tend to earn higher incomes than women.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9192 rows containing non-finite values (stat_smooth).
## Warning: Removed 9192 rows containing missing values (geom_point).
The scatter plots above show men and women’s income plotted against their spouses’ income. When a smoothed curve is added to represent how income varies with spouse’s income, we see markedly different trends for men and women. For men, personal income appears to be positively correlated with their spouse’s income at the higher ranges of the distribution for spouse’s income, while for women, the relationship is almost flat throughout the full range of the distribution. In other words, womens’ income stays about the same on average regardless of how much their spouses earn, while men’s income appears to drop slightly as their spouse’s income increases up to about $30,000, but then increases steadily as their spouse’s incomes increase above $30,000.
This analysis does not provide evidence for the proposal made by our second alternative hypothesis, which proposed that women’s income may be lower than men’s in part because they are strategically distributing the caretaking and “breadwinning” responsibilities with their spouses. If that were happenining to a significant extent, we would expect women’s wages to decrease slightly on average as their spouse’s income increases as some women dropped out of the work force to focus on parenting. Instead, what we find is that women’s incomes tend to increase along with their spouse’s income up about x=$150,000 and decline only when their spouse’s income exceeds $150,000.
This interpretation is slightly more consistent with what we see happening in the case of men, however, at least for lower income couples - that is, men tend to have lower incomes as their spouse’s income increase up to about x=$40,000, possibly reflecting the effects of strategic income sharing to balance household responsibilities (although this is certainly not the only explanation for what we see). Among higher income, couples, however, both partners’ incomes seem to increase together.
In this section, I fit a linear regression model to the relationship between gender and my chosen key variables and interpret the model coefficients. As noted above, several variables included in my part 1 analysis, including jobs_number and family_size, were excluded from my final model due to weak association with the main variables of interest (gender and income) and/or difficulties with interpretability. It should be also noted that the analysis of this section relies on certain assumptions that will not be evaluated until the next section (part (b)), which will determine whether a linear regression is appropriate for modeling the relationship between income and these variables, and correspondingly, whether the standard interpretation of the coefficients is valid.
The first thing to note from the output summary above is that gender is a highly statistically significant predictor of income at a p-value of < 2e-16. Even holding industry, educational attainment, marital status, and spouse’s income constant, being female is assosiciated with a $-38549.5766872 difference in income compared to being male. Altogether,the statistically significant coefficient estimates in this model include (Intercept), genderFemale, industryConstruction, industryWholesale Trade, industryFinance and Insurance, industryProfessional, Scientific, and Technical Services, industryManagement, Administrative and Support, and Waste Management Services, industryEducational Services, industryArts, Entertainment, and Recreation, industryOther Services (Except Public Administration, highest_grade1st year college, highest_grade2nd year college, highest_grade3rd year college, highest_grade4th year college, highest_grade5th year college, highest_grade6th year college, highest_grade7th year college, highest_grade8th year college or more. Below I provide an interpretation of a select few of these significant variables.
For the interpretations that follow, the baseline for comparison is a male who has never been married, has a high school education (has completed 12 years of education), works in the area of health care and social assistance, and has a spouse with an income of $0. This is, of course, merely a hypothetical scenario and doesn’t necessarily (or actually) represent any individual from our sample population. For ease of interpretation, all subsequent mentions of “holding all other variables constant” should be understood to connote this particular collection of features, save only for the facts that (a) the individual being compared against this baseline is female (and therefore carries a starting salary $-38549.58 lower than the male baseline) and (b) differs in the one additional respect specified (i.e., that for which the coefficient is being interpreted).
Referring to the output summary above, we see that working in Educational Services is associated with a $-59430.96 difference in income compared to being male, holding all other variables constant. Working in Finance and Insurance is associated with a $-16102.61 difference in income compared to being male, holding all other variables constant. Working in Professional, Scientific and Technical Services is associated with a $-23824.75 difference in income compared to being male, holding all other variables constant.
Similarly, with regard to level of educational attainment, we see that having completed 4 years of college is associated with a $3141.26 difference in income compared to being male, holding all other variables constant, while having completed 8 or more years of college is associated with a $59643.5 difference in income compared to being male, holding all other variables constant.
The R-squared for this model is 0.2821971, meaning that approximately 28.219712% of the variability in income is explained by knowing the values of our predictors.
Next, I turn to an evaluation of the linear regression model to better assess the validity of the interpretations provided in this section.
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
In this section, I discuss whether the standard diagnostic plots indicate issues with a linear regression model for gender and my chosen key variables. The issues I’m looking for include such things as trends in residuals, variance issues, outliers, etc.
First of all, when we plot the distribution of residuals for this relationship, we see that both the linearity assumption (i.e., that the residuals look like random scatter around the zero line and there is no evidence of structure or pattern in the residuals) as well as the homoscedasticity assumption (i.e., that there is equal variance in the deviance of each y value from the fitted line) is violated, meaning the relationship cannot be appropriately modeled by a linear regression. Instead, the data have a tendency to concentrate below where the fitted line would predict them to appear, indicating that the mean of the observed values is consistently lower than what we would expect if the relationship were linear.
The assumption of homoscedacity is somewhat more difficult to assess, but there appear to be slight variations in the deviance of y values at certain x values, specifically below x values of about 25,000. There also seems to be significant left to right fanning when we focus just on where the data is most densely concentration, although this fanning is clearly constrained by the rigid bottom limit, and to a somewhat lesser degree, the upper limit as well (likely a consequence of our topcoded values). In light of this analysis, we can conclude that any linear regression model of this model is going to be significantly limited in its predictive power.
The Scale-Location plot provides a more fine-tuned tool for assessing the assumption of homoscedastiticy, i.e., equal variance in the deviance of each y values from the fitted line. If this assumption were upheld, the red line running through the points would be approximately flat in the horizontal direction. However, that’s not what we see in our plot, indicating that we do not have equal variance in the deviation of our y values from the fitted line across all values of x. This analysis further supports our conclusion from above that a linear model of this relationship is not appropriate.
Next, we consider the Q-Q-plot. Q-Q plots take the sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution (https://data.library.virginia.edu/understanding-q-q-plots/). The superimposed line represents where the data would be expected to fall if its underlying distribution was normal. The Q-Q-plot above shows that the sample data does not conform to a normal distribution, as indicated by the sharp deviation of observed values above the 1.5 quantile values. There is also slight skewing at the lower end of the distribution, although this appears to be within an acceptable range.
Finally, the Residuals vs. Leverage plots allows us to identify influential data points in our model. The points we’re most concerned about are values in the upper right or lower right corners, which are outside the red dashed Cook’s distance line. These are points that would be influential in the model, possibly distorting our estimations. If such points were present, we’d want to consider removing them in order to get more accurate estimates from our model (https://medium.com/data-distilled/residual-plots-part-4-residuals-vs-leverage-plot-14aeed009ef7). It appears as though there may be one point (labeled 2681) that lies either on or just outside this boundary, suggesting our model might be improved by removing it. Because it’s on the cusp, however, I’ve decided to leave it.
In this section, I use the anova() function to assess whether each of the variables included in my model is a statistically significant predictor of income by comparing my model from part (a) to a model where each of these variables is excluded.
## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + highest_grade + marital_status + spouse_income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3087 10337872794679 -20 -457711543921 7.1041 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + marital_status + spouse_income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3085 11647196744774 -18 -1767035494015 30.474 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + highest_grade + spouse_income
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3071 9907054362983 -4 -26893112225 2.087 0.0799 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3068 9882437635679 -1 -2276384921 0.7066 0.4006
First, I apply the anova function to the industry variable. From this analysis, we see that industry is highly statistically significant at a p-value of 6.883115910^{-20}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all industries. In other words, the data suggests that the income gap between men and women varies with the industry in which one works.
When I apply the anove function to the highest_grade variable, we see that it, too, is highly statistically significant at a p-value of 5.90507810^{-96}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with level of education.
When I apply the anove function to the marital_status variable, however, we find that it is not statistically significant, having a p-value of 0.0799031. If a linear regression were appropriate for modeling this relationship, we would therefore not be able to reject the null hypothesis that the income gap is same across all levels of marital status. In other words, we do not have evidence that the income gap between men and women varies depending on a person’s marital status.
Finally, when I apply the anove function to the spouse_income variable, we see that it, too, is not statistically significant, having a p-value of 0.4006285. If a linear regression were appropriate for modeling this relationship, we would not be able to reject the null hypothesis that the income gap is same across all levels of spouse’s income. In other words, the data does not indicate that the income gap between men and women varies with their spouse’s income.
## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status +
## spouse_income + gender:industry
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3047 9760960253620 20 119200997138 1.8605 0.0114 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status +
## spouse_income + gender:highest_grade
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3052 9618402780284 15 261758470474 5.5372 2.852e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: income ~ gender + industry + highest_grade + marital_status +
## spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status +
## spouse_income + gender:marital_status
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3067 9880161250758
## 2 3063 9848920220902 4 31241029856 2.429 0.04573 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this study, we are exploring the question of whether there are any factors that exacerbate or mitigate the income gap between men and women. This is different from asking whether there are factors that affect income. In the latter case, we are only estimating the so-called main effects of each variable, while in the latter case, we are measuring the interaction effects, i.e., the effect that emerges when two variables appear together. The specific interaction effects that we’re interested in are those combining gender and one of our other chosen variables. By looking at the individual p-values for the interaction term coefficients, we can answer the question of whether the difference in income gap differs across different levels of these key variables (http://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture11/lecture11-94842.html).
The p-value for the first interaction variable (industry * gender) is statistically significant at a value of 0.0114029. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all industries. In other words, the data suggests that the income gap between men and women varies with industry"
The p-value for the second interaction variable (highest_grade * gender) is statistically significant at a value of 2.852343410^{-11}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with educational attainment."
The p-value for the third interaction variable (marital_status * gender) is also statistically significant at a value of 0.0457257. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with educational attainment."
In this section, I summarize my main conclusions and discuss potential limitations of my analysis and findings, beginning with potential confounders.
As noted throughout this report, there are a number of potential confounders I may not have accounted for in my analysis and limit the validity of my finaly analysis. The first of these, which I’ve already described in some detail, is that represented by the missing values. If these non-responses aren’t random, i.e., if they aren’t balanced across the various segments of our sample population, then they may introduce bias into our model by obscuring important differences in the non-responding portions of the population. These differences may have an impact both on the assignment of treatment (for those predictors that individuals have control over) as well as the outcome of interest (income), thereby confounding our results.
Another obvious source of possible confounders in my model are the x-number of variables from the original data set that I chose to omit. While omitting these variables reduces the risk of collinearity and yields a simpler, lower variance model, it also excludes many factors we might like to control for. Specifically in relation to my second alternative hypothesis, which speculated about influences eminating from one’s family unit, it may have been helpful to include those variables that related to respondents’ attitudes about gender, as well as those variables that would intuitively influence those attitudes, such as one’s religious beliefs and perhaps region (e.g., whether the respondent is from the more conservative south, or a rural rather than urban environment, etc.). such factors could easily operate as confounders in the context of that analysis.
Similarly, in the context of my first alternative hypothesis, which proposed that the wage gap between men and women might be explainable on account of systematic differences in men and women’s professional qualifications and/or occupational choices, it might have been informative to include the variable coding respondents’ occupations (occupation) rather than simply industry (industry). Almost all industries feature a prominent hierarchical structure that may parse male and female workers more neatly than the different industries themselves. Likewise, one’s position with this hierarchy may be more strongly predictive of one’s income than one’s choice of industry itself, in which case indsutry would be largely operating as a red herring in our analysis. Another variable I didn’t look at, but which could be acting as a confounder in relation to this analysis is that coding whether the respondent had a criminal record or history of drug abuse, both of which would be expected to correlated negatively with one’s job prospects and income. Based on what we know about the relationship between gender and criminal behavior (i.e., that men are much more likely to have a criminal record than women), however, I would not expect the inclusion of this variable to mitigate the wage gap we observe between men and women. If anything, the failure to control for it may actually be suppressing the true extent of men’s advantage over women in income.
Next, I address the issue of plausibility of the models presented in my final analysis.
All in all, the models presented in my final analysis told a consistent story of wage discrimination against women. The wage gap observed between men and women persisted across all other parameters we examined, including choice of industry, level of education, number of jobs, marital status, family size, and spouse’s income, providing strong evidence that it is, in fact, gender that is responsible for the differences we observe. At least to this extent, I believe my models to be telling an accurate story, and I find the results perfectly plausible.
The perhaps more interesting question is whether these other factors analyzed may be serving to mitigate or exacerbate the effect of gender on income. Unfortunately, this is where things become a lot more murky. First of all, the diagnostic plots indicate that a linear regression is not appropriate for modeling the relationship between income and my other chosen variables. This effectively undermines the plausibility of my linear models off the bat.
Unfortunately, the tabular and graphical summaries are much less useful for drawing the sorts of conclusions I’m interested in drawing, and are much more open to alternative interpretations. This is not to repudiate their plausibility, which I feel is reasonably strong; but it does mean I’m more limited in what I’m able to conclude on the basis of those models alone. I think the strongest case is made by taking all the tabular and graphical summaries together, and finding the explanation that consistently fits them all. However, I don’t think I’ve accumulated a sufficient number of consistent models to feel like I’m able to rule out the many other alternative narratives that might fit them equally well. In the final analysis, then, while I feel the tabular and graphical summaries are certainly plausible, I don’t think they constitute decisive evidence in support of any particular conclusion.
Before closing, I’ll consider the question of how much confidence I have in my analysis, i.e., whether I believe my conclusions and would feel confident presenting them to policy makers.
While I feel reasonably confident in the broad strokes of my analysis - e.g., that there is a wage gap between men and women, it is not fully explained by the influence of other, non-gender related factors, that there are significant interaction effects between gender and the other factors examined, etc. - I am significantly less confident in my estimates of specific coefficients, especially in light of what was learned from my diagnostic plots concerning the appropriateness of a linear regression to model the relationship between income and my other chosen variables. If my assessment of those plots is accurate, then we cannot rely on the standard interpretation of model coefficients or any analysis that is based on them.
In my assessment, the main utility of this analysis is to illuminate those patterns and trends that most directly bear on the question of whether there is a gender wage gap, to help us readjust or fine-tune our expectations according to what those patterns reveal, and to help us ask better, more well-targeted questions moving forward. Where I would encourage policy-makers to direct their immediate attention and effort is in identifying a more appropriate model for mapping the relationship between gender and income given what we now know about the limitations of a straight-forward linear model. Doing so should allow us to get beyond the sort of loose speculation I’ve engaged in in this report and closer to the sorts of causal claims we’re ultimately interested in making.